Closed yantosca closed 5 years ago
How large is the core file?
This issue happens when running GCHP both on the Amazon cloud and on Odyssey.
On Amazon, there is no core file, just the error message. The output says: "You can avoid this message by specifying -quiet on the mpirun command line."
On Odyssey (since we run GCHP in SLURM), the core file is 4 GB:
-rw------- 1 ryantosca jacob_lab 4079263744 2018-12-14 16:16 core.41383
On Odyssey, in the slurm-62418670.out file, we get this error output:
*** Error in `/n/regal/jacob_lab/ryantosca/GCHP/gfortran82/gchp_standard/./geos':
free(): invalid size: 0x000000001ba65db0 ***
then a lot of hexadecimal addresses, then
Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0 0x2ab12877426f in ???
#1 0x2ab1287741f7 in ???
.. etc ...
srun: error: holy2a05312: task 0: Aborted (core dumped)
slurmstepd: error: holy2a05312 [0] pmixp_client_v2.c:203 [_errhandler] mpi/pmix: ERROR: Error handler invoked: status = -25: Interrupted system call (4)
srun: error: holy2a05312: tasks 1-5: Exited with exit code 244
real 9m42.507s
user 0m0.123s
sys 0m0.471s
On Odyssey, if I use --mpi=pmi2 instead of --mpi=pmix, then we get this stderr output but not the "free() invalid size" error. We just get this:
Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
Backtrace for this error:
#0 0x2ac3527fb26f in ???
#1 0x2ac354a8cd36 in ???
#2 0x2ac354a8de57 in ???
...etc...
srun: error: holy2a05312: task 0: Segmentation fault (core dumped)
real 7m39.491s
user 0m0.118s
sys 0m0.410s
I am closing this issue because the root cause appears to be #15. Fixing #15 will also fix this issue.
GCHP 12.1.1 runs normally and prints out all timing information at the end, but nevertheless drops a core file at the end of the run. This might be particular to the libraries on our Odyssey cluster.
We are using
Not a huge deal but if anyone has seen the same issue with similar libraries, then please let us know. Have a hunch this was caused by a local update to SLURM.