geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] GCHP finishes successfully but drops a core file at the end of the run #11

Closed yantosca closed 5 years ago

yantosca commented 5 years ago

GCHP 12.1.1 runs normally and prints out all timing information at the end, but nevertheless drops a core file at the end of the run. This might be particular to the libraries on our Odyssey cluster.

We are using

Not a huge deal but if anyone has seen the same issue with similar libraries, then please let us know. Have a hunch this was caused by a local update to SLURM.

JiaweiZhuang commented 5 years ago

How large is the core file?

yantosca commented 5 years ago

This issue happens when running GCHP both on the Amazon cloud and on Odyssey.

On Amazon, there is no core file, just the error message. The output says: "You can avoid this message by specifying -quiet on the mpirun command line."

On Odyssey (since we run GCHP in SLURM), the core file is 4 GB:

-rw------- 1 ryantosca jacob_lab 4079263744 2018-12-14 16:16 core.41383

On Odyssey, in the slurm-62418670.out file, we get this error output:

*** Error in `/n/regal/jacob_lab/ryantosca/GCHP/gfortran82/gchp_standard/./geos':
 free(): invalid size:     0x000000001ba65db0 ***
then a lot of hexadecimal addresses, then
Program received signal SIGABRT: Process abort signal.

Backtrace for this error:
#0  0x2ab12877426f in ???
#1  0x2ab1287741f7 in ???
.. etc ...
srun: error: holy2a05312: task 0: Aborted (core dumped)
slurmstepd: error: holy2a05312 [0] pmixp_client_v2.c:203 [_errhandler] mpi/pmix: ERROR: Error handler    invoked: status = -25: Interrupted system call (4)
srun: error: holy2a05312: tasks 1-5: Exited with exit code 244

real    9m42.507s
user    0m0.123s
sys     0m0.471s
yantosca commented 5 years ago

On Odyssey, if I use --mpi=pmi2 instead of --mpi=pmix, then we get this stderr output but not the "free() invalid size" error. We just get this:

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
#0  0x2ac3527fb26f in ???
#1  0x2ac354a8cd36 in ???
#2  0x2ac354a8de57 in ???
...etc...
srun: error: holy2a05312: task 0: Segmentation fault (core dumped)

real    7m39.491s
user    0m0.118s
sys     0m0.410s
yantosca commented 5 years ago

I am closing this issue because the root cause appears to be #15. Fixing #15 will also fix this issue.