NOAA-EMC / global-workflow

Global Superstructure/Workflow supporting the Global Forecast System (GFS)
https://global-workflow.readthedocs.io/en/latest
GNU Lesser General Public License v3.0
75 stars 168 forks source link

Duplicate mkdir commands causes metp job failure #2971

Open DavidHuber-NOAA opened 1 week ago

DavidHuber-NOAA commented 1 week ago

What is wrong?

The metp* jobs have the potential to attempt to create the same directory twice, raising an OSError in the METplus Python code. This is a known bug in METplus v3.1.1, but we are unfortunately stuck with this version.

What should have happened?

The metp* jobs should create all necessary directories just once.

What machines are impacted?

All or N/A, Orion

What global-workflow hash are you using?

https://github.com/NOAA-EMC/global-workflow/commit/8f0541cd61755e74e1f3116dedcc3afc8fa9cda1

Steps to reproduce

Observed on Orion:

  1. Clone and build
  2. Run the C48_ATM CI test case through the metp jobs (for me, gfsmetpg2o1 failed)

Additional information

No response

Do you have a proposed solution?

Currently, to try and avoid this error, the EMC_verif-global script ush/create_METplus_job_scripts.py sleeps for 1 second between CFP job submissions. Initial testing on Orion suggests that increasing this to 5 seconds was successful at preventing the failure for the C48_ATM test case. However, this is just an adjustment of a band-aid solution and may still not fix the issue in all cases.

Perhaps a more robust solution would be to make all necessary directories before starting the CFP-portion of the job. This would require cataloging all of the directories created by METplus for each case (g2o1, g2g1, pcp1).

DavidHuber-NOAA commented 1 week ago

FYI @DWesl

DWesl commented 1 week ago

This is somewhat expected, though I was hoping that between the delay and the automatic re-runs rocoto does it would be largely irrelevant.

One method of dealing with this would be a hybrid serial-parallel execution: run the first job, wait for it to complete so all shared directories are created, then run the other (three?) in parallel afterward. I'm not sure how well this would work with the current system, if there are two processors standing idle while the first process runs.

If that's not workable for whatever reason, you could try restarting the script if the exit code corresponds to it running into the error, perhaps with a maximum time limit or grepping through a log file for the corresponding file location to avoid infinite loops or mis-configured data paths triggering this as well.

mkdir -p ... would work eventually, would du ${model_stat_dir} provide the required list of directories?

Ideally someone would add exist_ok=True to the os.makedirs call, but I think policy prevents that, in part due to all the different places that would need to happen