IBM / powerai

This repo contains ancillary information used to assist users of IBM Watson Machine Learning Community Edition. This repo will contain How To's, Readme's, Dockerfiles, etc. that can be consumed by users looking to get started.
BSD 2-Clause "Simplified" License
57 stars 54 forks source link

Stable Baselines won't run in conda environment #29

Open mark-hoffmann opened 5 years ago

mark-hoffmann commented 5 years ago

I am trying to use the conda environments with powerai packages and am trying to install stable-baslines (https://github.com/hill-a/stable-baselines). I have run into issues that I think are relating to mpi4py?

Because it seems like there isn't a distribution for mpi4py on a conda channel, we installed it on our power machine. I then added the appropriate .pth file so that the location it is installed is sourced properly. This allows me to do the example import such as: from mpi4py import MPI. However, when doing so, we recognize the following warnings:

>>> from mpi4py import MPI
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hns': libhns-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'i40iw': libi40iw-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'bnxt_re': libbnxt_re-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'vmw_pvrdma': libvmw_pvrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb3': libcxgb3-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'cxgb4': libcxgb4-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'hfi1verbs': libhfi1verbs-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'nes': libnes-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'qedr': libqedr-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ocrdma': libocrdma-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'mthca': libmthca-rdmav2.so: cannot open shared object file: No such file or directory
libibverbs: Warning: couldn't load driver 'ipathverbs': libipathverbs-rdmav2.so: cannot open shared object file: No such file or directory
>>> 

Going ahead, I was going to see if stable-baslines would still work. There isn't a conda channel for the package so I was going to try to install from source. I cloned the repo and had to comment out opencv-python in the setup.py file because in order to get opencv installed properly I had to do: conda install -c conda-forge opencv. I then install via pip install -e ..

Now if I try to either import stable_baselines or the same command from before: from mpi4py import MPI, instead of just getting the warning messages from above, we end up with the following errors:

In [3]: import stable_baselines                                                                                                                                                                                                
--------------------------------------------------------------------------
It looks like opal_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during opal_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_shmem_base_select failed
  --> Returned value -1 instead of OPAL_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  opal_init failed
  --> Returned value Error (-1) instead of ORTE_SUCCESS
--------------------------------------------------------------------------
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_mpi_init: ompi_rte_init failed
  --> Returned "Error" (-1) instead of "Success" (0)
--------------------------------------------------------------------------
*** An error occurred in MPI_Init_thread
*** on a NULL communicator
*** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
***    and potentially your MPI job)
[SNA-MINSKY-N05:95209] Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Do you guys have any insight as to what could be going on with powerpc and mpi4py?

hartb commented 5 years ago

Hi @mark-hoffmann! Just wanted to confirm that we're looking at this.

In the mean time, can you confirm which MPI you're using for the above?

The Spectrum MPI that ships with PowerAI / Watson Machine Learning Community Edition (WML CE) on Power has a license that restricts its use only to the WML CE-provided frameworks. We don't think that's the problem you're seeing above, but might be an issue once you're past that.

mark-hoffmann commented 5 years ago

I believe we are trying this with OpenMPI as the underlying package as opposed to Spectrum MPI.

hartb commented 5 years ago

OK; good. OpenMPI is what we'd likely recommend for this use case.

We're trying to build Stable Baselines locally to see if we can reproduce.

mark-hoffmann commented 5 years ago

Great! Thank you for helping look into this! This might be a good package to pre-build and put into your powerai distribution as well since the setup is difficult if it is possible. I know Google Colab has this in their default installations as well.

bnemanich commented 5 years ago

Hi, I was able to install mpi4py using pip and run stable-baselines. I followed these steps: 1) Installed OpenMPI in the system: yum install openmpi3-devel. 2) Created a new conda environment that contains tensorflow: conda create -n baseline_test tensorflow-gpu 3) Activated the new environment: conda activate baseline_test 4) Install mpi4py: pip install mpi4py 5) Install the rest of stable-baseline. As you mentioned in your earlier post, the opencv-python package needs to be removed from the setup file. Instead, you can install py-opencv in your conda environment using conda install py-opencv.

One thing to note, if you are trying to run inside a conda environment that previously had spectrum-mpi installed, you might need to run: unset MPI_ROOT unset OPAL_PREFIX

It would probably be better to run in a fresh conda environment by following the instructions above.

mark-hoffmann commented 5 years ago

Thanks for the quick response!

I'm unfortunately still having trouble. I tried to recreate the exact steps you did, but still end up with the error: ImportError: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory

Also, Is there any way to install torch in this same environment as well? I can't find any distribution when I try to do a normal installation of torch. I was usually just doing the command: conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/linux-ppc64le/ powerai=1.6.0, but this time I was trying to install only what I needed. It seems to install toch we have to like so: conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda pytorch, but this also forces the installation of spectrum-mpi, which we can't use for stable_baselines? When I conda uninstall spectrum-mpi my ability to use torch goes away.

hartb commented 5 years ago

Regarding the PowerAI / WML CE pytorch package and spectrum-mpi...

Yes, our pytorch package on Power has a hard dependency on spectrum-mpi (for distribution using either our Distributed Deep Learning (DDL) or torch's native MPI support). So for now the only solution for using both a framework built against OpenMPI and our pytorch would be to install them in separate conda environments. You could then flip between the environments with conda deactivate/conda activate ... as needed.

Our WML CE pytorch package for x86 (in the same conda channel) is built against OpenMPI, rather than Spectrum MPI. That would be more convenient for this build, but would miss out on what we feel are some performance advantages of Power.

In the future we expect to release a CPU-only pytorch package that will forgo MPI support altogether (and so wouldn't clash with OpenMPI). But as the description suggests, that would lack GPU support, and so likely isn't a good choice for model training. (We think CPU-only make more sense for inference only, and that's the rationale for omitting MPI support there: we expect there will be less call for pytorch's distribution for inference.)