LSSTDESC / rail

Top level "umbrella" package for RAIL
MIT License
5 stars 3 forks source link

NERSC-specific installation documentation #122

Open OliviaLynn opened 6 months ago

OliviaLynn commented 6 months ago

There are two items from #51 that will not necessarily be addressed for v1, but we may still want to include:

sschmidt23 commented 6 months ago

What is the "ceci path issue" referring to? Is it the issue that there is an old copy of ceci somewhere in the base python path on NERSC? I usually get around that by creating a custom conda env from scratch, which seems to resolve that issue.

For NERSC-specific instructions and creating a custom conda environment, getting mpi4py and hdf5 writing set up correctly at NERSC can be a bit of a pain. This may already be included somewhere in RAIL docs, but in case it's not, there's a NERSC page addressing this: https://docs.nersc.gov/development/languages/python/parallel-python/

I've gotten things working by installing both mpi4py and h5py from source with the following procedure (I'm going to copy/paste from a slack message I sent to Josue a while back):

Following the directions on that Parallel Python page, I could not get the pre-built conda environments nersc-mpi4py or nersc-h5py to work correctly, either mpi4py or h5py would have problems. The solution that worked for me was to install both mpi4py and h5py myself in a new conda environment, following the instructions for that on the NERSC webpage. Here's the rough procedure for how I put together an environment to run rail_tpz in parallel two weeks ago:

  1. log in to NERSC
  2. do a module load python to load the module with a base conda (skip if you have a local conda at NERSC)
  3. run conda create -n [envname] python=3.10 numpy scipy
  4. run conda activate [envname]
  5. do a module swap PrgEnv-${PE_ENV,,} PrgEnv-gnu to make sure that the PrgEnv-gnu module is loaded rather than the other one.
  6. install mpi4py with MPICC="cc -shared" pip install --force-reinstall --no-cache-dir --no-binary=mpi4py mpi4py
  7. load the hdf5-parallel module at NERSC with module load cray-hdf5-parallel
  8. install python with conda install -c defaults --override-channels numpy "cython<3"
  9. install h5py with HDF5_MPI=ON CC=cc pip install -v --force-reinstall --no-cache-dir --no-binary=h5py --no-build-isolation --no-deps h5py
  10. clone whatever rail package you need e.g. git clone https://github.com/LSSTDESC/rail_tpz.git
  11. install rail_tpz (or whichever package) with pip install -e . in the rail_tpz directory

We could probably set up a conda environment with steps 1-8 somewhere that users can clone to make things easier. This should work with the pre-built nersc-h5py and nersc-mpi4py are, though I could not get those to work for me.

ztq1996 commented 5 months ago

I will try this on NERSC and if it works out we can put this into the documentation and close

sschmidt23 commented 5 months ago

coincidentally, I just did the above set of instructions again today to set up a fresh environment to re-train a rail_tpz model, and things worked fine, I submitted a job to the debug queue using 5 processors and everything worked as intended.

oh, and I missed the (hopefully obvious) step 3.5 in the above instructions: conda activate [envname].

ztq1996 commented 5 months ago

I can follow Sam's guide to run the tpz notebook on NERSC, we should include this in the installation documentation (of rail_tpz?).

sschmidt23 commented 5 months ago

I think this is more general than rail_tpz, I follow the same procedure if I want to run rail_flexzboost in parallel at NERSC, for example. Not sure where the best place for this would be.

OliviaLynn commented 5 months ago

Notes from the meeting: