Silent failure of gefs2lbc_para with AQMv7

JianpingHuang-NOAA commented 1 year ago

@ytangnoaa @bbakernoaa

gefs2lbc_para got a silent failure by using both rocoto and ecflow. Below is the error message

Stopped FAST_BYTESWAP ALGORITHM HAS BEEN USED AND DATA ALIGNMENT IS CORRECT FOR 4 ) forrtl: error (78): process killed (SIGTERM) Image PC Routine Line Source gefs2lbc_para 000000000049881B Unknown Unknown Unknown libpthread-2.31.s 00001539CC4908C0 Unknown Unknown Unknown libmpi_intel.so.1 00001539C883A005 Unknown Unknown Unknown libmpi_intel.so.1 00001539C72ECA49 Unknown Unknown Unknown libmpi_intel.so.1 00001539C809BA36 Unknown Unknown Unknown libmpi_intel.so.1 00001539C80AE061 Unknown Unknown Unknown libmpi_intel.so.1 00001539C7FBDA62 Unknown Unknown Unknown libmpi_intel.so.1 00001539C668DC91 Unknown Unknown Unknown libmpi_intel.so.1 00001539C668DDB8 Unknown Unknown Unknown libmpi_intel.so.1 00001539C829C28B Unknown Unknown Unknown libmpi_intel.so.1 00001539C668DE80 Unknown Unknown Unknown libmpi_intel.so.1 00001539C8292E72 Unknown Unknown Unknown libmpi_intel.so.1 00001539C80DEF25 Unknown Unknown Unknown libmpi_intel.so.1 00001539C842385F Unknown Unknown Unknown libmpi_intel.so.1 00001539C6B30EEC MPI_Finalize Unknown Unknown libmpifort_intel. 00001539CCD64639 MPI_FINALIZE Unknown Unknown gefs2lbc_para 000000000041693B Unknown Unknown Unknown gefs2lbc_para 000000000040B812 Unknown Unknown Unknown libc-2.31.so 00001539CBD6F2BD __libc_start_main Unknown Unknown gefs2lbc_para 000000000040B72A Unknown Unknown Unknown nid001448.cactus.wcoss2.ncep.noaa.gov: rank 1 exited with code 1 Application 2a10ddd4-ab39-4d81-bb48-0fcd97e18d14 resources: utime=832s stime=3s maxrss=1600548KB inblock=8521816 oublock=33200 minflt=21689 majflt=28 nvcsw=2497 nivcsw=297

, which can be found on Cactus at /lfs/h2/emc/stmp/jianping.huang/aqm/ecflow_aqm/aqm_lbcs_00.58334313.cbqs01 > vim errfile

Thanks,

Jianping

JianpingHuang-NOAA commented 1 year ago

@ytangnoaa Lin (EIB, ecflow developer) reported it failed occasionally. He saw one to two times failure each day.

It runs successfully once the job is resubmitted.

I saw similar issue in my recent retro runs. It failed at the first time and ran successfully after it is resubmitted.

ytangnoaa commented 1 year ago

/lfs/h2/emc/stmp/jianping.huang/aqm/ecflow_aqm/aqm_lbcs_00.58334313.cbqs01

In that folder, in file "OUTPUT.199903", it complained

finish reading topofile No such file or directory

So, it failed to open an input NEMSIO file. It looks that some files were not visible to some nodes at that time. So, after a while, your rerun went successfully.

JianpingHuang-NOAA commented 1 year ago

I summarize some message that I got from EIB

(1) The error message "No such file or directory" is misleading because the file actually exists and other jobs can access it without any problems.

(2) Instead of displaying "Stopped," the executable should accurately reflect that it has failed. Currently, the job appears as "Stopped" until it reaches the wall clock limit and is subsequently terminated by the system. This indicates an exception handling issue within the executable.

JianpingHuang-NOAA commented 1 year ago

@ytangnoaa You can see all the AQM LBCs jobs have been running for more than 40 minutes until they reach the wall clock limit (1 hour). Usually this kind of job takes only a few minutes. This further demonstrates that the code needs further revision.

ytangnoaa commented 1 year ago

I summarize some message that I got from EIB

(1) The error message "No such file or directory" is misleading because the file actually exists and other jobs can access it without any problems.

It is not misleading . It indicated that the GEFS files are not visible for some nodes at that moment. Later the files on disk were synchronized, and the rerun succeeded with the same script/code, showing that the script/code is good. Forcing synchronizing GEFS files to disk should help.

(2) Instead of displaying "Stopped," the executable should accurately reflect that it has failed. Currently, the job appears as "Stopped" until it reaches the wall clock limit and is subsequently terminated by the system. This indicates an exception handling issue within the executable.

Yes, the code can be changed to print more information about the NEMSIO failure. Please try https://github.com/noaa-oar-arl/AQM-utils/tree/gefs2clbcs-update1

ytangnoaa commented 1 year ago

@ytangnoaa You can see all the AQM LBCs jobs have been running for more than 40 minutes until they reach the wall clock limit (1 hour). Usually this kind of job takes only a few minutes. This further demonstrates that the code needs further revision.

As mentioned "Usually this kind of job takes only a few minutes", Yes, it usually took several minutes in the past, with the exact same script/code on Hera or WCOSS2. The prolonged run times is more likely due to recent WCOSS2 system issue instead of script/code issue

JianpingHuang-NOAA commented 11 months ago

@chan-hoo @ytangnoaa @bbakernoaa

I continually met an issue of running the gefs2lbs_para for generating aerosol LBCs with GEFS/Aerosol output files during the retro reruns for August 2022. The issue is the same as I reported in the past, which is similar to what was reported by Lin when he developed the ecflow for AQMv7. But this time the issue became more serious than before when the Dev machine was switched back to Cactus on July 21st.

It is noticed that the job failed only at 00z and 18z cycles for part of simulation days and no any failure found for 06z and 12z cycles.

For the jobs failed at 00z and 18z cycle, the job ran successfully after I resubmitted the job manually.

You can check the run log and output files below as an example.

1) an example for the run log file: /lfs/h2/emc/ptmp/jianping.huang/emc.para/output/20220806/aqm_lbcs_2022080600.id_1690302133.log

2) the error message can be found from the following file /lfs/h2/emc/ptmp/jianping.huang/emc.para/tmp/aqm_s.72978755.cbqs01/OUTPUT.166918

The failure jobs always complained

"No such file or directory read zh_bottom, zh_top read zh_left, zh_right for /lfs/h2/emc/aqmtemp/gefs/v12.3/20220806/00/geaer.t00z.atmf000.nemsio

However, when you check the above run log file, you can find that the nemsio file actually existed and was found by the ex-script. I added several lines (Line 188-201) in the script, exregional_aqm_lbcs.sh (/lfs/h2/emc/physics/noscrub/jianping.huang/nwdev/packages/aqm.v7.0.82b/scripts) to check the existence of each nemsio file and then do "touch" and "ls " to check the files. If the script fails to find the nemsio file, and wait for 5s and recheck it again for three times. In any case, the compute nodes still failed to find the 1st nemsio file after using the workflow for job submission for executing the code, gefs2lbc_para.

My questions are 1) Why do the jobs only fail for 00z and 18z cycle and no any issue for 06z and 12z cycle?

2) Why do the jobs fail when it reads the 1st GEFS/Aerosol output nemsio file but not for other hour(s) at 00z and 18z cycle?

3) Why did the job fail with running workflow?

4) Why does this failure only happen for running aqm_lbc job not for other jobs?

I have tried several ways 1) increasing the memory limit in the job card, but it does not help. 2) using the same number of compute nodes for 00z and 18z cycle as the one used by 06z and 12z cycle, but it does not work. 3) adding a capability to check the existence of gefs/aerosol nemsio files, and to ensure the existence of the nemsio data files, but it does not help too.

Any way, we need to fix the issue as soon as possible in order to move the AQMv7 retro runs forward.

Please consider this as the first priority of your concerns.

Thanks,

Jianping

JianpingHuang-NOAA commented 11 months ago

attached a screen shot of the generated aqm.t00z.gfs_bndy.tile7.f000.nc and aqm.t18z.gfs_bndy.tile7.f000.nc for your reference here.

You can see for some days, the file size of aqm.t00z.gfs_bndy or aqm.t18z.gfs_bndy is much smaller than that on other days. They are highlighted with the red boxes.

GeorgeVandenberghe-NOAA commented 11 months ago

Those boundary file sizes look like exactly 4 gbytes-1 suggesting an undetected write error at their creation. I suspect they should all be around 19+ gbytes. They're netcdf files. Are they being written sequentially or in parallel and are they written compressed or uncompressed.

HaixiaLiu-NOAA commented 11 months ago

@chan-hoo would you check on this issue? Thanks.

KaiWang-NOAA commented 11 months ago

Those smaller *f000.nc bdy files only contain meteorological variables and are missing both gas and aerosol variables. The 19+ gbytes files are good with correct number of variables. However the 00z or 18z simulations seem to be able to run through with those smaller LBC files without complaints.

JianpingHuang-NOAA commented 11 months ago

@Chan-Hoo Jeon - NOAA Affiliate @.***> All the failed aqm_lbc jobs were completed successfully after manually submitting the jobs with the same job card settings but without using rocoto workflow.

A run log example file can be found with /lfs/h2/emc/ptmp/jianping.huang/emc.para/output/20220806 > ls -ltr aqm_lbcs_2022080600.id_c82b.log (Cactus)

On Tue, Jul 25, 2023 at 3:39 PM KaiWang-NOAA @.***> wrote:

Those smaller *f000.nc bdy files only contain meteorological variables and are missing both gas and aerosol variables. The 19+ gbytes files are good with correct number of variables. However the 00z or 18z simulations seem to be able to run through with those smaller LBC files without complaints.

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/AQM/issues/88#issuecomment-1650597225, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2PI5WTAIDNEGFMMHKY5DXSA4HDANCNFSM6AAAAAAYIFO5MI . You are receiving this because you authored the thread.Message ID: @.***>

chan-hoo commented 11 months ago

I think I am not the right person who can resolve this issue. I don't have any experience in this kind of issue.

ytangnoaa commented 11 months ago

Those boundary file sizes look like exactly 4 gbytes-1 suggesting an undetected write error at their creation. I suspect they should all be around 19+ gbytes. They're netcdf files. Are they being written sequentially or in parallel and are they written compressed or uncompressed.

To clarify, the input is GEFS NEMSIO file and the output is netcdf file. This issue occurred during the reading input NEMSIO file, and it mainly happened in the 00z and 18z (very short cycles and lasts only 6 hours) after the WCOSS2 switch machine. Resubmitting job sometimes can solve it. This issues seems related to machine or system

JianpingHuang-NOAA commented 11 months ago

This is very difficult to understand. The failure only happened for 00z and 18z cycles and I did not see any failure for 06z and 12z cycles. On the other hand, I did not see any failure for other tasks/jobs if this is a machine related issue. It seems that the job submission node was able to find the existing nemsio input files but the computation cpu failed to do that.

On Tue, Jul 25, 2023 at 5:02 PM Youhua Tang @.***> wrote:

Those boundary file sizes look like exactly 4 gbytes-1 suggesting an undetected write error at their creation. I suspect they should all be around 19+ gbytes. They're netcdf files. Are they being written sequentially or in parallel and are they written compressed or uncompressed.

To clarify, the input is GEFS NEMSIO file and the output is netcdf file. This issue occurred during the reading input NEMSIO file, and it mainly happened in the 00z and 18z (very short cycles and lasts only 6 hours) after the WCOSS2 switch machine. Resubmitting job sometimes can solve it. This issues seems related to machine or system

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/AQM/issues/88#issuecomment-1650680463, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2PI3OXNK2F6NTQP7WZADXSBGADANCNFSM6AAAAAAYIFO5MI . You are receiving this because you authored the thread.Message ID: @.***>

yangfanglin commented 11 months ago

Are the NEMSIO files soft links in the running directory or hard copies ? Are they read by one node then broadcasted or read simultaneously by all computing nodes ?

JianpingHuang-NOAA commented 11 months ago

The ex-script specifies/defines the location of the GEFS/Aerosol NEMSIO output files. I am considering creating a symbolic link to the temporary run directory and testing whether it can successfully locate the files. The perplexing thing is that the code is unable to find the first NMSIO file at 00z and 18z cycles while all the others are being detected without any problem. The difference between 00z/18z and 06/12z is that the former handles 6-hr LBCs (i.e., 2 files) only and the latter deals with 72-hr LBCs (i.e., 13 files).

On Tue, Jul 25, 2023 at 8:41 PM Fanglin Yang @.***> wrote:

Are the NEMSIO files soft links in the running directory or hard copies ? Are they read by one node then broadcasted or read simultaneously by all computing nodes ?

— Reply to this email directly, view it on GitHub https://github.com/NOAA-EMC/AQM/issues/88#issuecomment-1650887222, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANA2PI7RGAAPQKYBWI23HD3XSB7ULANCNFSM6AAAAAAYIFO5MI . You are receiving this because you authored the thread.Message ID: @.***>

zmoon commented 11 months ago

@bbakernoaa suggested trying replacing ncks usage with Python script. I am working on this now.

https://github.com/NOAA-EMC/AQM-utils/pull/10

NOAA-EMC / AQM

Silent failure of gefs2lbc_para with AQMv7 #88