geoschem / gchp_legacy

Repository for GEOS-Chem High Performance: software that enables running GEOS-Chem on a cubed-sphere grid with MPI parallelization.
http://wiki.geos-chem.org/GEOS-Chem_HP
Other
7 stars 13 forks source link

[BUG/ISSUE] HDF5 error due to too many open files #42

Closed JiaweiZhuang closed 4 years ago

JiaweiZhuang commented 5 years ago

Describe the bug

I got this HDF error after 5 days of C180 simulation with 288 cores:

 AGCM Date: 2016/07/06  Time: 00:00:00

 Writing:   6975 Slices (  3 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.Emissions.20160706_0000z.nc4
 Writing:  11736 Slices (  4 Nodes,  4 PartitionRoot) to File:  OutputDir/GCHP.SpeciesConc.20160706_0000z.nc4
 Writing:    439 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_avg.20160706_0000z.nc4
 Writing:    439 Slices (  1 Nodes,  1 PartitionRoot) to File:  OutputDir/GCHP.StateMet_inst.20160706_0000z.nc4
There are 248958 HDF5 objects open!

Report: open objects on 72057594037930872
Type = File(72057594037927938) name='/'Type = File(72057594037927939) name='/'Type = File(72057594037927940) name='/'Type = File(72057594037927941) name='/'Type = File(72057594037927942) name='/'Type = File(72057594037927943) name='/'Type = File(72057594037927944) name='/'Type = File(72057594037927945) name='/'Type = File(72057594037927946) name='/'Type = File(72057594037927947) name='/'Type = File(72057594037927948) name='/'Type = File(72057594037927949) name='/'Type = File(72057594037927950) name='/'Type = File(72057594037927951) name='/'Type = File(72057594037927952) name='/'Type = File(72057594037927953) name='/'Type = File(72057594037927954) name='/'Type = File(72057594037927955) name='/'Type = File(7
...

Here's the complete log: run_c180_7days_N8n288_hdf5_error.log

Given that the error occurs after 5 days of simulation, I suspect that GCHP keeps opening new files without closing previous ones.

To Reproduce Steps to reproduce the behavior:

  1. Same setup as #37
  2. Apply #20 to avoid writing huge checkpoints
  3. Run simulation with 8 c5n.18xlarge EC2 nodes. In runConfig.sh:
    NUM_NODES=8
    NUM_CORES_PER_NODE=36
    NY=48
    NX=6
  4. Use default diagnostics containing 4 collections: Emissions, SpeciesConc, StateMet_avg, StateMet_inst. But change the frequency to one-write-per-day:
    common_freq="240000"
    common_dur="240000"
    common_mode="'instantaneous'"

Required information

JiaweiZhuang commented 5 years ago

I am trying to fix this issue by Increasing The Maximum Number Of Open Files.

On CentOS, the default number is:

$ ulimit -n
10000

Changing the number with ulimit -n 16384 leads to permission error. But you can edit /etc/security/limits.conf such that

centos soft nofile 16384
centos hard nofile 16384

where centos is the user name. Re-login, and ulimit -n should show the new number.

JiaweiZhuang commented 5 years ago

Problem solved by raising ulimit -n as above and reducing the output collections (only keep SpeciesConc). Complete log: run_c180_7days_N8n288_pass_hdf5_issue.log

However, the simulation finished with a very long trace of HDF5 error (the start and end of the trace are shown below). Hope it doesn't affect anything...

...
 NOT using buffer I/O for file: cap_restart
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   00002AE2723BF555  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002AE2743215D0  Unknown               Unknown  Unknown
libpthread-2.17.s  00002AE27432075D  __close               Unknown  Unknown
libhdf5.so.103.1.  00002AE271120B5B  Unknown               Unknown  Unknown
libhdf5.so.103     00002AE27110F572  H5FD_close            Unknown  Unknown
libhdf5.so.103     00002AE2710FADC4  H5F__dest             Unknown  Unknown
libhdf5.so.103     00002AE2710FC164  H5F_try_close         Unknown  Unknown
libhdf5.so.103     00002AE2710FBDDC  H5F__close_cb         Unknown  Unknown
libhdf5.so.103.1.  00002AE27118326E  Unknown               Unknown  Unknown
libhdf5.so.103     00002AE271261800  H5SL_try_free_saf     Unknown  Unknown
libhdf5.so.103     00002AE271183169  H5I_clear_type        Unknown  Unknown
libhdf5.so.103     00002AE2710EAA9E  H5F_term_package      Unknown  Unknown
libhdf5.so.103     00002AE27102D08A  H5_term_library       Unknown  Unknown
libc-2.17.so       00002AE27476BC29  Unknown               Unknown  Unknown
libc-2.17.so       00002AE27476BC77  Unknown               Unknown  Unknown
libifcoremt.so.5   00002AE2723B2BEF  for_exit              Unknown  Unknown
geos               00000000006FC3F6  MAIN__                     49  GEOSChem.F90
geos               000000000040FE42  Unknown               Unknown  Unknown
libc-2.17.so       00002AE274754495  __libc_start_main     Unknown  Unknown
geos               000000000040FD49  Unknown               Unknown  Unknown
forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   00002B6B76EB4555  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B6B78E165D0  Unknown               Unknown  Unknown

...

forrtl: error (78): process killed (SIGTERM)
Image              PC                Routine            Line        Source             
libifcoremt.so.5   00002B9210C0A555  for__signal_handl     Unknown  Unknown
libpthread-2.17.s  00002B9212B6C5D0  Unknown               Unknown  Unknown
libpthread-2.17.s  00002B9212B6B75D  __close               Unknown  Unknown
libhdf5.so.103.1.  00002B920F96BB5B  Unknown               Unknown  Unknown
libhdf5.so.103     00002B920F95A572  H5FD_close            Unknown  Unknown
libhdf5.so.103     00002B920F945DC4  H5F__dest             Unknown  Unknown
libhdf5.so.103     00002B920F947164  H5F_try_close         Unknown  Unknown
libhdf5.so.103     00002B920F946DDC  H5F__close_cb         Unknown  Unknown
libhdf5.so.103.1.  00002B920F9CE26E  Unknown               Unknown  Unknown
libhdf5.so.103     00002B920FAAC800  H5SL_try_free_saf     Unknown  Unknown
libhdf5.so.103     00002B920F9CE169  H5I_clear_type        Unknown  Unknown
libhdf5.so.103     00002B920F935A9E  H5F_term_package      Unknown  Unknown
libhdf5.so.103     00002B920F87808A  H5_term_library       Unknown  Unknown
libc-2.17.so       00002B9212FB6C29  Unknown               Unknown  Unknown
libc-2.17.so       00002B9212FB6C77  Unknown               Unknown  Unknown
libifcoremt.so.5   00002B9210BFDBEF  for_exit              Unknown  Unknown
geos               00000000006FC3F6  MAIN__                     49  GEOSChem.F90
geos               000000000040FE42  Unknown               Unknown  Unknown
libc-2.17.so       00002B9212F9F495  __libc_start_main     Unknown  Unknown
geos               000000000040FD49  Unknown               Unknown  Unknown
--------------------------------------------------------------------------
lizziel commented 4 years ago

Thanks for reporting the fix!

lizziel commented 4 years ago

Ah, now I see there were still problems after this fix. @JiaweiZhuang were you able to resolve this? I have been using ifort19 with OpenMPI4 and at high resolution without issue, but using more recent versions (both GCHP and MAPL History).

lizziel commented 4 years ago

I am closing out this issue due to inactivity. If there are further problems related to this issue please open a new issue at GCHPctm.