Closed sbailey closed 2 years ago
Hi folks, there is a known issue (segfault) on KNL with MKL that we have experienced in our CMB tools. It seems to be triggered by compiled code which uses OpenMP and also links to MKL. I will add @sbailey and @marcelo-alvarez to the CC list for that ticket (it is still open). In the mean time I also attach here my minimal working example that fails, so that you can see if it is similar to the failure you are seeing:
A simple workaround is setting
MKL_FAST_MEMORY_LIMIT=0
before running desi_proc
on KNL. I have verified that this fixes the segmentation fault problem and have modified the desiconda/20211217-2.0.0
module file on cori to set this generally. I plan to update desiconda so that future installations on Cori set this automatically.
@tskisner: Thanks for including us in the NERSC ticket and sharing your example; setting MKL_FAST_MEMORY_LIMIT=0
before running knl_segfault
fixes the segmentation fault for me on KNL.
My guess is that there is an inconsistency with how memory is allocated using threaded MKL (the error does not seem to occur when using sequential MKL) with the memkind library for MCDRAM on KNL, see here. Setting MKL_FAST_MEMORY_LIMIT=0
results in regular system memory allocation routines being used instead of memkind, and is likely to prevent these kinds of memkind-related segmentation faults on Cori KNL, generally.
Thanks! I confirm that it works for me too. Closing ticket.
We're getting specex segfaults on KNL again, though it appears that they only occur after the code has finished successfully running and has finished writing outputs. We saw something like this before but I don't recall the solution.
Steps to reproduce:
Example output from /global/cfs/cdirs/desi/users/sjbailey/spectro/redux/knlarc/run/scripts/night/20220401/arc-20220401-00128285-a0-57991462.log :
I don't know when KNL started breaking again, but I'm guessing it was due to the recent Cori OS upgrade.
@marcelo-alvarez please investigate and fix (perhaps with something as simple as a recompile...)
It is critical to fix this before the next run (launching before the end of April) so that we can run arcs on KNL.