bccp / nbodykit

Analysis kit for large-scale structure datasets, the massively parallel way
http://nbodykit.rtfd.io
GNU General Public License v3.0
110 stars 60 forks source link

install nbodykit on NERSC Perlmutter #675

Open biweidai opened 1 year ago

biweidai commented 1 year ago

Hi,

The nbodykit built for NERSC cori ( https://nbodykit.readthedocs.io/en/latest/getting-started/install.html#nbodykit-on-nersc ) does not seem to work on Perlmutter. When trying to load nbodykit with

source /global/common/software/m3035/conda-activate.sh 3.8

I got the following error:

/global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 6: /etc/profile.d/nerschost.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 7: /etc/profile.d/modules.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 8: /etc/profile.d/mpi-selector.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 9: /etc/bash.bashrc.local: No such file or directory

So I tried to create my conda environment with nbodykit. I install mpi4py with

module swap PrgEnv-${PE_ENV,,} PrgEnv-gnu MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py

following https://docs.nersc.gov/development/languages/python/parallel-python/#mpi4py-in-your-custom-conda-environment I tested it on Perlmutter computing node and it works fine.

But when I try to install nbodykit with

conda install -c bccp nbodykit

It doesn't use the mpi4py I built and reinstalls mpi4py with conda, which no longer works on Perlmutter computing nodes. Can I force it to use the mpi4py I built?

I also tried reinstalling mpi4py again to overwrite the mpi4py conda installed, and I got the following error when running the code:

Attempting to use an MPI routine before initializing MPICH

rainwoodman commented 1 year ago

One difficulty I see is that nbodykit has to be built with the PrgEnv on NERSC, thus getting the bccp conda channel version of nbodykit won't likely work with the PrgEnv version of mpi4py anyway. You will likely need to rebuild all of the nbodykit dependency packages with PrgEnvGnu. It is possible that using pip install after PrgEnvGnu can get you quite far. Did you try that?

The scripts in the m3035 project roughly does that, but via conda-build. The scripts there also create a conda-channel with these PrgEnv built packages for cori (at the m3035 project folder). The bcast-bccp-3.8 environment was using that channel. The more 'proper' way of fixing this might be upgrading those channel building scripts for Permutter, especially if you plan to run things at scale with the 'bcast' style environments.

Hi,

The nbodykit built for NERSC cori ( https://nbodykit.readthedocs.io/en/latest/getting-started/install.html#nbodykit-on-nersc ) does not seem to work on Perlmutter. When trying to load nbodykit with

source /global/common/software/m3035/conda-activate.sh 3.8

I got the following error:

/global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 6: /etc/profile.d/nerschost.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 7: /etc/profile.d/modules.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 8: /etc/profile.d/mpi-selector.sh: No such file or directory /global/common/software/m3035/conda/envs/bcast-bccp-3.8/etc/conda/activate.d/activate-nersc-prgenv-gnu.sh: line 9: /etc/bash.bashrc.local: No such file or directory

So I tried to create my conda environment with nbodykit. I install mpi4py with

module swap PrgEnv-${PE_ENV,,} PrgEnv-gnu MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py

following https://docs.nersc.gov/development/languages/python/parallel-python/#mpi4py-in-your-custom-conda-environment I tested it on Perlmutter computing node and it works fine.

But when I try to install nbodykit with

conda install -c bccp nbodykit

It doesn't use the mpi4py I built and reinstalls mpi4py with conda, which no longer works on Perlmutter computing nodes. Can I force it to use the mpi4py I built?

I also tried reinstalling mpi4py again to overwrite the mpi4py conda installed, and I got the following error when running the code:

Attempting to use an MPI routine before initializing MPICH

— Reply to this email directly, view it on GitHub https://github.com/bccp/nbodykit/issues/675, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABBWTDQQYK7K7RAGFAW5JTWV5A7XANCNFSM6AAAAAAURTAD34 . You are receiving this because you are subscribed to this thread.Message ID: @.***>

jmsull commented 1 year ago

Lagging behind Biwei I am just now running into this issue now that cori is gone forever (and have not yet found a way for this to work).

biweidai commented 1 year ago

Yu's suggestion works for me! I manually install the nbodykit dependencies and nbodykit with PrgEnv-gnu and pip install. Have you tried this? By the way, if you are going to run nbodykit on jupyter lab, I think the conda install should work.

jmsull commented 1 year ago

For posterity, here is what I did:

module load PrgEnv-gnu
module load gsl
conda create --name nbodykit_env python=3.7 pip
env MPICC=cc python -m pip install --no-cache-dir mpi4py
srun -n 5 python -m mpi4py.bench helloworld # test mpi4py
pip install numpy cython
pip install nbodykit[extras] 

which seems to work on a perlmutter compute node

jmsull commented 6 months ago

@rainwoodman As you alluded to toward the top of this issue, recreating the bcast-pip scripts would be nice to have - I am running some <4 node jobs and am seeing 5-10 mins of startup time. IIRC bcast-pip improves this. Biwei and I were discussing this today and may have some bandwidth to do this if you can get us started?