geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] GCHP c48 runs on AWS within Docker container die within 1 hour #14

Closed yantosca closed 5 years ago

yantosca commented 5 years ago

I ran a GCHP c48 run AWS cloud using

AMI         : container_geoschem_tutorial_2018121
Machine     : r4.4xlarge
Diagnostics : SpeciesConc_avg and SpeciesConc_inst 

and it died after an hour.

In runConfig.sh:

# Make sure your settings here match the resources you request on your
# cluster in your run script!!!
NUM_NODES=1
NUM_CORES_PER_NODE=12
NY=12
NX=1

# MAPL shared memory option (0: off, 1: on). Keep off unless you know what
# you are doing. Contact GCST for more information if you have memory
# problems you are unable to fix.
USE_SHMEM=0

#------------------------------------------------
#   Internal Cubed Sphere Resolution
#------------------------------------------------
CS_RES=48    # 24 ~ 4x5, 48 ~ 2x2.5, 90 ~ 1x1.25, 180 ~ 1/2 deg, 360 ~ 1/4 deg

...
Start_Time="20160701 000000"
End_Time="20160701 010000"
Duration="00000000 010000"
....
common_freq="010000"
common_dur="010000"
common_mode="'time-averaged'"

The Docker commands were:

docker pull docker pull geoschem/gchp_model
docker run --rm -it -v $HOME/ExtData:/ExtData -v $HOME/OutputDir:/OutputDir geoschem/gchp_model
mpirun -np 12 -oversubscribe --allow-run-as-root ./geos | tee gchp.log.c48

Tail end of log file:

 AGCM Date: 2016/07/01  Time: 00:10:00

Program received signal SIGSEGV: Segmentation fault - invalid memory reference.

Backtrace for this error:
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 0 with PID 0 on node 22262d174fea exited on signal 11 (Segmentation fault).
--------------------------------------------------------------------------

I then commented out SpeciesConc_avg from the HISTORY.rc file and re-ran. Now, the only diagnostic active was SpeciesConc_inst. This also died at 1 hour:

AGCM Date: 2016/07/01  Time: 01:00:00

 Writing:  11592 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc_inst.20160701_0100z.nc4
free(): invalid next size (normal)

Program received signal SIGABRT: Process abort signal.
Backtrace for this error:
#0  0x7efd1c1dd2da in ???
#1  0x7efd1c1dc503 in ???
..etc..

Times for GIGCenv
TOTAL                   :       0.726
INITIALIZE              :       0.000
RUN                     :       0.723
...etc...
HEMCO::Finalize... OK.
Chem::State_Diag Finalize... OK.
Chem::State_Chm Finalize... OK.
Chem::State_Met Finalize... OK.
Chem::Input_Opt Finalize... OK.
 Using parallel NetCDF for file: gcchem_internal_checkpoint_c48.nc
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun noticed that process rank 6 with PID 0 on node 22262d174fea exited on signal 6 (Aborted).
--------------------------------------------------------------------------

This message:

free(): invalid next size (normal)

might be indicative of an out-of-bounds error, perhaps where we deallocate arrays (or fields of State_* objects).

JiaweiZhuang commented 5 years ago

Same issue when running it natively on the AMI?

JiaweiZhuang commented 5 years ago

Stop the instance, change its type to r5.24xlarge, restart and run again. If that still dies then it is definitely not a (inadequate) memory problem...

yantosca commented 5 years ago

So I ran again in the container in r5.24xlarge and now I get this error:

 AGCM Date: 2016/07/01  Time: 00:10:00
At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90
Fortran runtime error: End of record

Error termination. Backtrace:
#0  0x7f657849c2da in ???
#1  0x7f657849cec5 in ???
#2  0x7f657849d68d in ???

... etc...

Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[28387,1],0]
  Exit code:    2
--------------------------------------------------------------------------

So it would appear to be an issue internal to MAPL. Or I might have run out of disk space. I requested 500 GB though.

Also I had run in the AMI itself earlier at c48 and had similar crashes to the

JiaweiZhuang commented 5 years ago

That's new message though:

> At line 2731 of file /tutorial/gchp_standard/CodeDir/GCHP/ESMF/src/Superstructure/State/src/ESMF_StateAPI.F90 
Fortran runtime error: End of record

Haven't seen this ever before...

yantosca commented 5 years ago

There are some references to this issue.

https://gcc.gnu.org/bugzilla/show_bug.cgi?id=20257 |https://stackoverflow.com/questions/29489388/end-of-record-error-when-saving-a-variable https://stackoverflow.com/questions/32684816/end-of-record-error-in-file-opening

It was a bug in gfortran but was supposed to be fixed in 4.1. But who knows

yantosca commented 5 years ago

This issue seems to have been caused by an out-of-bounds error in the Olson landmap module, as described in https://github.com/geoschem/gchp/issues/13#issuecomment-449134471

JiaweiZhuang commented 5 years ago

Interesting! Why not happening in C24🤔 Can c48 run on AWS now?

yantosca commented 5 years ago

So what appears to be happening is that the Olson landmap is not getting read in properly. This is happening in the code where State_Met%LandTypeFrac is populated from the OLSON Pointers from ExtData. Not sure why this is happening but it may be a MAPL issue. The OLSON data is read in by the custom code in MAPL to read in fraction of grid box (the "F:int" feature).

So while you can run on the cloud with the quick fix, I would avoid doing that until we understand the root cause of why the State_Met%LandTypeFrac is all zero.

yantosca commented 5 years ago

I am closing this thread because the root cause is #15. Fixing #15 will fix this issue.