Open mark-hoffmann opened 5 years ago
Hi @mark-hoffmann! Just wanted to confirm that we're looking at this.
In the mean time, can you confirm which MPI you're using for the above?
The Spectrum MPI that ships with PowerAI / Watson Machine Learning Community Edition (WML CE) on Power has a license that restricts its use only to the WML CE-provided frameworks. We don't think that's the problem you're seeing above, but might be an issue once you're past that.
I believe we are trying this with OpenMPI
as the underlying package as opposed to Spectrum MPI
.
OK; good. OpenMPI is what we'd likely recommend for this use case.
We're trying to build Stable Baselines locally to see if we can reproduce.
Great! Thank you for helping look into this! This might be a good package to pre-build and put into your powerai distribution as well since the setup is difficult if it is possible. I know Google Colab has this in their default installations as well.
Hi,
I was able to install mpi4py using pip and run stable-baselines. I followed these steps:
1) Installed OpenMPI in the system: yum install openmpi3-devel
.
2) Created a new conda environment that contains tensorflow: conda create -n baseline_test tensorflow-gpu
3) Activated the new environment: conda activate baseline_test
4) Install mpi4py: pip install mpi4py
5) Install the rest of stable-baseline. As you mentioned in your earlier post, the opencv-python package needs to be removed from the setup file. Instead, you can install py-opencv in your conda environment using conda install py-opencv
.
One thing to note, if you are trying to run inside a conda environment that previously had spectrum-mpi
installed, you might need to run:
unset MPI_ROOT
unset OPAL_PREFIX
It would probably be better to run in a fresh conda environment by following the instructions above.
Thanks for the quick response!
I'm unfortunately still having trouble. I tried to recreate the exact steps you did, but still end up with the error: ImportError: libmpiprofilesupport.so.3: cannot open shared object file: No such file or directory
Also, Is there any way to install torch in this same environment as well? I can't find any distribution when I try to do a normal installation of torch. I was usually just doing the command: conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda/linux-ppc64le/ powerai=1.6.0
, but this time I was trying to install only what I needed. It seems to install toch we have to like so: conda install -c https://public.dhe.ibm.com/ibmdl/export/pub/software/server/ibm-ai/conda pytorch
, but this also forces the installation of spectrum-mpi
, which we can't use for stable_baselines? When I conda uninstall spectrum-mpi
my ability to use torch goes away.
Regarding the PowerAI / WML CE pytorch package and spectrum-mpi...
Yes, our pytorch
package on Power has a hard dependency on spectrum-mpi
(for distribution using either our Distributed Deep Learning (DDL) or torch's native MPI support). So for now the only solution for using both a framework built against OpenMPI and our pytorch would be to install them in separate conda environments. You could then flip between the environments with conda deactivate
/conda activate ...
as needed.
Our WML CE pytorch
package for x86 (in the same conda channel) is built against OpenMPI, rather than Spectrum MPI. That would be more convenient for this build, but would miss out on what we feel are some performance advantages of Power.
In the future we expect to release a CPU-only pytorch package that will forgo MPI support altogether (and so wouldn't clash with OpenMPI). But as the description suggests, that would lack GPU support, and so likely isn't a good choice for model training. (We think CPU-only make more sense for inference only, and that's the rationale for omitting MPI support there: we expect there will be less call for pytorch's distribution for inference.)
I am trying to use the conda environments with powerai packages and am trying to install
stable-baslines
(https://github.com/hill-a/stable-baselines). I have run into issues that I think are relating tompi4py
?Because it seems like there isn't a distribution for
mpi4py
on a conda channel, we installed it on our power machine. I then added the appropriate.pth
file so that the location it is installed is sourced properly. This allows me to do the example import such as:from mpi4py import MPI
. However, when doing so, we recognize the following warnings:Going ahead, I was going to see if
stable-baslines
would still work. There isn't a conda channel for the package so I was going to try to install from source. I cloned the repo and had to comment outopencv-python
in thesetup.py
file because in order to get opencv installed properly I had to do:conda install -c conda-forge opencv
. I then install viapip install -e .
.Now if I try to either
import stable_baselines
or the same command from before:from mpi4py import MPI
, instead of just getting the warning messages from above, we end up with the following errors:Do you guys have any insight as to what could be going on with powerpc and mpi4py?