Closed JiaweiZhuang closed 4 years ago
I am trying to fix this issue by Increasing The Maximum Number Of Open Files.
On CentOS, the default number is:
$ ulimit -n
10000
Changing the number with ulimit -n 16384
leads to permission error. But you can edit /etc/security/limits.conf
such that
centos soft nofile 16384
centos hard nofile 16384
where centos
is the user name. Re-login, and ulimit -n
should show the new number.
Problem solved by raising ulimit -n
as above and reducing the output collections (only keep SpeciesConc). Complete log: run_c180_7days_N8n288_pass_hdf5_issue.log
However, the simulation finished with a very long trace of HDF5 error (the start and end of the trace are shown below). Hope it doesn't affect anything...
...
NOT using buffer I/O for file: cap_restart
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libifcoremt.so.5 00002AE2723BF555 for__signal_handl Unknown Unknown
libpthread-2.17.s 00002AE2743215D0 Unknown Unknown Unknown
libpthread-2.17.s 00002AE27432075D __close Unknown Unknown
libhdf5.so.103.1. 00002AE271120B5B Unknown Unknown Unknown
libhdf5.so.103 00002AE27110F572 H5FD_close Unknown Unknown
libhdf5.so.103 00002AE2710FADC4 H5F__dest Unknown Unknown
libhdf5.so.103 00002AE2710FC164 H5F_try_close Unknown Unknown
libhdf5.so.103 00002AE2710FBDDC H5F__close_cb Unknown Unknown
libhdf5.so.103.1. 00002AE27118326E Unknown Unknown Unknown
libhdf5.so.103 00002AE271261800 H5SL_try_free_saf Unknown Unknown
libhdf5.so.103 00002AE271183169 H5I_clear_type Unknown Unknown
libhdf5.so.103 00002AE2710EAA9E H5F_term_package Unknown Unknown
libhdf5.so.103 00002AE27102D08A H5_term_library Unknown Unknown
libc-2.17.so 00002AE27476BC29 Unknown Unknown Unknown
libc-2.17.so 00002AE27476BC77 Unknown Unknown Unknown
libifcoremt.so.5 00002AE2723B2BEF for_exit Unknown Unknown
geos 00000000006FC3F6 MAIN__ 49 GEOSChem.F90
geos 000000000040FE42 Unknown Unknown Unknown
libc-2.17.so 00002AE274754495 __libc_start_main Unknown Unknown
geos 000000000040FD49 Unknown Unknown Unknown
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libifcoremt.so.5 00002B6B76EB4555 for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B6B78E165D0 Unknown Unknown Unknown
...
forrtl: error (78): process killed (SIGTERM)
Image PC Routine Line Source
libifcoremt.so.5 00002B9210C0A555 for__signal_handl Unknown Unknown
libpthread-2.17.s 00002B9212B6C5D0 Unknown Unknown Unknown
libpthread-2.17.s 00002B9212B6B75D __close Unknown Unknown
libhdf5.so.103.1. 00002B920F96BB5B Unknown Unknown Unknown
libhdf5.so.103 00002B920F95A572 H5FD_close Unknown Unknown
libhdf5.so.103 00002B920F945DC4 H5F__dest Unknown Unknown
libhdf5.so.103 00002B920F947164 H5F_try_close Unknown Unknown
libhdf5.so.103 00002B920F946DDC H5F__close_cb Unknown Unknown
libhdf5.so.103.1. 00002B920F9CE26E Unknown Unknown Unknown
libhdf5.so.103 00002B920FAAC800 H5SL_try_free_saf Unknown Unknown
libhdf5.so.103 00002B920F9CE169 H5I_clear_type Unknown Unknown
libhdf5.so.103 00002B920F935A9E H5F_term_package Unknown Unknown
libhdf5.so.103 00002B920F87808A H5_term_library Unknown Unknown
libc-2.17.so 00002B9212FB6C29 Unknown Unknown Unknown
libc-2.17.so 00002B9212FB6C77 Unknown Unknown Unknown
libifcoremt.so.5 00002B9210BFDBEF for_exit Unknown Unknown
geos 00000000006FC3F6 MAIN__ 49 GEOSChem.F90
geos 000000000040FE42 Unknown Unknown Unknown
libc-2.17.so 00002B9212F9F495 __libc_start_main Unknown Unknown
geos 000000000040FD49 Unknown Unknown Unknown
--------------------------------------------------------------------------
Thanks for reporting the fix!
Ah, now I see there were still problems after this fix. @JiaweiZhuang were you able to resolve this? I have been using ifort19 with OpenMPI4 and at high resolution without issue, but using more recent versions (both GCHP and MAPL History).
Describe the bug
I got this HDF error after 5 days of C180 simulation with 288 cores:
Here's the complete log: run_c180_7days_N8n288_hdf5_error.log
Given that the error occurs after 5 days of simulation, I suspect that GCHP keeps opening new files without closing previous ones.
To Reproduce Steps to reproduce the behavior:
c5n.18xlarge
EC2 nodes. InrunConfig.sh
:Required information