desihub / specex

DESI spectrograph PSF fitting
BSD 3-Clause "New" or "Revised" License
0 stars 4 forks source link

specex segfaults on KNL #63

Closed sbailey closed 2 years ago

sbailey commented 2 years ago

We're getting specex segfaults on KNL again, though it appears that they only occur after the code has finished successfully running and has finished writing outputs. We saw something like this before but I don't recall the solution.

Steps to reproduce:

source $CFS/desi/software/desi_environment.sh main

export DESI_SPECTRO_REDUX=$CFS/desi/users/$USER/spectro/redux
export SPECPROD=knlarc
mkdir -p $DESI_SPECTRO_REDUX/$SPECPROD

# works fine on haswell
desi_proc --batch -n 20220401 -e 128284 --cameras a0 --system-name cori-haswell

# will report  segfaults in
# un/scripts/night/20220401/run/scripts/night/20220401/arc-20220401-00128285*.log and have a non-zero return code from the job
desi_proc --batch -n 20220401 -e 128285 --cameras a0 --system-name cori-knl

Example output from /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/knlarc/run/scripts/night/20220401/arc-20220401-00128285-a0-57991462.log :

...
INFO:proc.py:1340:main: All done at Tue Apr 19 11:35:33 2022; duration 10m22s
srun: error: nid02612: tasks 3,19-20: Segmentation fault
srun: launch/slurm: _step_signal: Terminating StepId=57991462.0
srun: error: nid02613: task 28: Segmentation fault
slurmstepd: error: *** STEP 57991462.0 ON nid02612 CANCELLED AT 2022-04-19T11:35:35 ***
srun: error: nid02612: tasks 1-2,4-18: Segmentation fault
srun: error: nid02613: tasks 21-27,29-40: Segmentation fault
FAILED: done at Tue Apr 19 11:35:36 PDT 2022

I don't know when KNL started breaking again, but I'm guessing it was due to the recent Cori OS upgrade.

@marcelo-alvarez please investigate and fix (perhaps with something as simple as a recompile...)

It is critical to fix this before the next run (launching before the end of April) so that we can run arcs on KNL.

tskisner commented 2 years ago

Hi folks, there is a known issue (segfault) on KNL with MKL that we have experienced in our CMB tools. It seems to be triggered by compiled code which uses OpenMP and also links to MKL. I will add @sbailey and @marcelo-alvarez to the CC list for that ticket (it is still open). In the mean time I also attach here my minimal working example that fails, so that you can see if it is similar to the failure you are seeing:

knl_segfault.tar.gz

marcelo-alvarez commented 2 years ago

A simple workaround is setting

MKL_FAST_MEMORY_LIMIT=0

before running desi_proc on KNL. I have verified that this fixes the segmentation fault problem and have modified the desiconda/20211217-2.0.0 module file on cori to set this generally. I plan to update desiconda so that future installations on Cori set this automatically.

@tskisner: Thanks for including us in the NERSC ticket and sharing your example; setting MKL_FAST_MEMORY_LIMIT=0 before running knl_segfault fixes the segmentation fault for me on KNL.

My guess is that there is an inconsistency with how memory is allocated using threaded MKL (the error does not seem to occur when using sequential MKL) with the memkind library for MCDRAM on KNL, see here. Setting MKL_FAST_MEMORY_LIMIT=0 results in regular system memory allocation routines being used instead of memkind, and is likely to prevent these kinds of memkind-related segmentation faults on Cori KNL, generally.

sbailey commented 2 years ago

Thanks! I confirm that it works for me too. Closing ticket.