mpi4py memory issue for healpix coadds

araichoor commented 3 months ago

I ve been running the cross-tiles redux for the elg tertiary38 tiles, and I get an error in redrock for half (9/18) of the coadds.

one “specificity” of that rosette is that we ve 25 tiles (as for other tertiary programs), but we requested only one obs. per target, so we ve a lot more than usual individual targetids; I d say e.g. a twice denser sample than in sv3 rosettes which typically had ~13 tiles.

the error likely comes from the fact that those 9 coadds have too many rows:

the 9 successful coadds have 3772 rows at max;
the 9 failed coadds have between 8049 and 13016 rows.

comment from @sbailey: "mpi4py has a limit of 2 GB for the largest thing it can pass between processes due to a limitation of the underlying python pickle module. I don't know what the exact threshold is in terms of number of targets, but basically your input files are too big, and need to move to smaller healpix with fewer targets, or otherwise to more nodes with more MPI ranks so that there is less data per rank."

example of one of my calls from an interactive node (note: command edited from orig. message with an -v0):

srun -N 1 -n 32 -c 2 -t 01:00:00 rrdesi_mpi -i /pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313-v0/coadd-27344.fits -o /pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313-v0/redrock-27344.fits --details /pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313-v0/rrdetails-27344.h5 &> /pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313-v0/logs/redrock-27344.log

and the reported error:

[...]
Finding best redshift: 182.2 seconds
--- Process 0 raised an exception ---
Proc 0: Traceback (most recent call last):
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/code/redrock/main/py/redrock/external/desi.py", line 916, in rrdesi
    scandata, zfit = zfind(targets, dtemplates, mpprocs,
Proc 0:   File "/global/common/software/desi/perlmutter/desiconda/20230111-2.1.0/code/redrock/main/py/redrock/zfind.py", line 405, in zfind
    results = targets.comm.gather(results, root=0)
Proc 0:   File "mpi4py/MPI/Comm.pyx", line 1578, in mpi4py.MPI.Comm.gather
Proc 0:   File "mpi4py/MPI/msgpickle.pxi", line 773, in mpi4py.MPI.PyMPI_gather
Proc 0:   File "mpi4py/MPI/msgpickle.pxi", line 778, in mpi4py.MPI.PyMPI_gather
Proc 0:   File "mpi4py/MPI/msgpickle.pxi", line 191, in mpi4py.MPI.pickle_allocv
Proc 0:   File "mpi4py/MPI/msgpickle.pxi", line 182, in mpi4py.MPI.pickle_alloc
Proc 0: SystemError: Negative size passed to PyBytes_FromStringAndSize

MPICH Notice [Rank 0] [job id 22938720.0] [Thu Mar 14 15:23:21 2024] [nid005169] - Abort(0) (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0

aborting job:
application called MPI_Abort(MPI_COMM_WORLD, 0) - process 0
srun: error: nid005169: task 0: Exited with exit code 255
srun: Terminating StepId=22938720.0
slurmstepd: error: *** STEP 22938720.0 ON nid005169 CANCELLED AT 2024-03-14T22:23:22 ***

araichoor commented 3 months ago

from quickly looking at the code:

the list of healpix pixels come from reading the tilepix-{tileid}.json file in the preproc/ folder; this file maps which pixels should be considered for each petal:

cat /pscratch/sd/r/raichoor/daily/preproc/20240309/00229541/tilepix-83444.json
{"83444": {"0": [27239, 27245, 27256, 27258], "1": [27244, 27245, 27246, 27247, 27256, 27258], "2": [27246, 27247, 27258, 27333], "3": [27247, 27258, 27333, 27344], "4": [27258, 27259, 27344, 27345], "5": [27258, 27259, 27345, 27348], "6": [27258, 27259, 27262, 27348], "7": [27256, 27257, 27258, 27259, 27260, 27262], "8": [27250, 27251, 27256, 27257, 27258], "9": [27239, 27250, 27256, 27258]}}

in my scripts at least, I get the list of pixels to process with the desispec.pixgroup.get_exp2healpix_map() function: could we add in that function a nside=64 optional argument, with the requirement that nside >= 64? that way I could just call that function with nside=128, and hopefully it should work? (as the smaller pixels will be enclosed in the nside=64 ones, so the pixel<->petal mapping will still be valid).

araichoor commented 3 months ago

for info, as I ll be working on this, I ve moved:

/pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313

to:

/pscratch/sd/r/raichoor/daily/healpix/tertiary38-thru20240313-v0

I ve edited the example call to rrdesi_mpi in the original message.

moustakas commented 3 months ago

@araichoor have you tried running it single-threaded, with no parallelism? It'll need more time, but in that case you'll have access to all the memory on the (CPU or GPU) node and won't hit the 2GB pickling limit.

araichoor commented 3 months ago

"have you tried running it single-threaded, with no parallelism? " => no, I didn t. I m happy to test; but could you provide the correct srun invocation?

for what is worth: this night I ve run with using nside=128 (instead of nside=64), and it worked; the largest number of spectra I ve then per pixel is 5901.

moustakas commented 2 months ago

I propose we close this ticket as resolved. The 2GB pickling limit is baked into both multiprocessing and MPI, so reducing the memory footprint of the input data (as @araichoor has done) is the way to go.

@araichoor to run "single-threaded" just don't use any parallelism, e.g.,

srun -N 1 -n 1 -c 2 -t 01:00:00 rrdesi_mpi -i ...

but that will be a lot slower, of course (and the node may still run out of memory depending on how many input spectra there are). Using larger-nside healpixels seems like a more reasonable choice, I think, for this specific problem.

araichoor commented 2 months ago

thanks.

for what I was working on, yes, using a larger-nside healpixel worked (nside=128 instead of the "default" nside=64).

however, for e.g. jura, I guess we will want to keep the same healpixel size for all cross-tile coadds. so, for what is worth, I gave a try in an interactive session (4h) with nside=64:

srun -N 1 -n 1 -c 2 -t 04:00:00 rrdesi_mpi -i ...

it looks to me it should take ~9h to run the pixel with the most numerous rows: is it problematic or not for a production? (maybe not if gpus are used, sorry I m ignorant here).

here are the number of row per file (nside=64):

# HPXPIXEL NROW
27238 16
27246 876
27244 1945
27239 4037
27251 3178
27250 8349
27245 13016
27247 12501
27256 12034
27260 1669
27257 12157
27258 11457
27262 3772
27333 3205
27259 12983
27348 337
27345 8049
27344 10410

I get the following timings for the four pixels that ran in my 4h session:

# HPXPIXEL NROW SECONDS
27238 16    46.4
27246 876   2160.3
27244 1945  4764.6
27239 4037  9942.7

it roughly scales as SECONDS = 2.5 * NROW. this scaling gives ~9h for HPXPIXEL=27245 which as 13k rows.

desihub / redrock

mpi4py memory issue for healpix coadds #288