Open OliviaLynn opened 6 months ago
What is the "ceci path issue" referring to? Is it the issue that there is an old copy of ceci somewhere in the base python path on NERSC? I usually get around that by creating a custom conda env from scratch, which seems to resolve that issue.
For NERSC-specific instructions and creating a custom conda environment, getting mpi4py and hdf5 writing set up correctly at NERSC can be a bit of a pain. This may already be included somewhere in RAIL docs, but in case it's not, there's a NERSC page addressing this: https://docs.nersc.gov/development/languages/python/parallel-python/
I've gotten things working by installing both mpi4py and h5py from source with the following procedure (I'm going to copy/paste from a slack message I sent to Josue a while back):
Following the directions on that Parallel Python page, I could not get the pre-built conda environments nersc-mpi4py or nersc-h5py to work correctly, either mpi4py or h5py would have problems. The solution that worked for me was to install both mpi4py and h5py myself in a new conda environment, following the instructions for that on the NERSC webpage. Here's the rough procedure for how I put together an environment to run rail_tpz in parallel two weeks ago:
module load python
to load the module with a base conda (skip if you have a local conda at NERSC)conda create -n [envname] python=3.10 numpy scipy
conda activate [envname]
module swap PrgEnv-${PE_ENV,,} PrgEnv-gnu
to make sure that the PrgEnv-gnu module is loaded rather than the other one.MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py
module load cray-hdf5-parallel
conda install -c defaults --override-channels numpy "cython<3"
HDF5_MPI=ON CC=cc pip install -v --force-reinstall --no-cache-dir --no-binary=h5py --no-build-isolation --no-deps h5py
git clone https://github.com/LSSTDESC/rail_tpz.git
pip install -e .
in the rail_tpz directoryWe could probably set up a conda environment with steps 1-8 somewhere that users can clone to make things easier. This should work with the pre-built nersc-h5py
and nersc-mpi4py
are, though I could not get those to work for me.
I will try this on NERSC and if it works out we can put this into the documentation and close
coincidentally, I just did the above set of instructions again today to set up a fresh environment to re-train a rail_tpz model, and things worked fine, I submitted a job to the debug queue using 5 processors and everything worked as intended.
oh, and I missed the (hopefully obvious) step 3.5 in the above instructions: conda activate [envname]
.
I can follow Sam's guide to run the tpz notebook on NERSC, we should include this in the installation documentation (of rail_tpz?).
I think this is more general than rail_tpz, I follow the same procedure if I want to run rail_flexzboost in parallel at NERSC, for example. Not sure where the best place for this would be.
Notes from the meeting:
There are two items from #51 that will not necessarily be addressed for v1, but we may still want to include:
ceci
path issue