Open DavidHuber-NOAA opened 1 week ago
FYI @DWesl
This is somewhat expected, though I was hoping that between the delay and the automatic re-runs rocoto does it would be largely irrelevant.
One method of dealing with this would be a hybrid serial-parallel execution: run the first job, wait for it to complete so all shared directories are created, then run the other (three?) in parallel afterward. I'm not sure how well this would work with the current system, if there are two processors standing idle while the first process runs.
If that's not workable for whatever reason, you could try restarting the script if the exit code corresponds to it running into the error, perhaps with a maximum time limit or grepping through a log file for the corresponding file location to avoid infinite loops or mis-configured data paths triggering this as well.
mkdir -p ...
would work eventually, would du ${model_stat_dir}
provide the required list of directories?
Ideally someone would add exist_ok=True
to the os.makedirs
call, but I think policy prevents that, in part due to all the different places that would need to happen
What is wrong?
The metp* jobs have the potential to attempt to create the same directory twice, raising an
OSError
in the METplus Python code. This is a known bug in METplus v3.1.1, but we are unfortunately stuck with this version.What should have happened?
The metp* jobs should create all necessary directories just once.
What machines are impacted?
All or N/A, Orion
What global-workflow hash are you using?
https://github.com/NOAA-EMC/global-workflow/commit/8f0541cd61755e74e1f3116dedcc3afc8fa9cda1
Steps to reproduce
Observed on Orion:
C48_ATM
CI test case through the metp jobs (for me,gfsmetpg2o1
failed)Additional information
No response
Do you have a proposed solution?
Currently, to try and avoid this error, the EMC_verif-global script
ush/create_METplus_job_scripts.py
sleeps for 1 second between CFP job submissions. Initial testing on Orion suggests that increasing this to 5 seconds was successful at preventing the failure for theC48_ATM
test case. However, this is just an adjustment of a band-aid solution and may still not fix the issue in all cases.Perhaps a more robust solution would be to make all necessary directories before starting the CFP-portion of the job. This would require cataloging all of the directories created by METplus for each case (
g2o1
,g2g1
,pcp1
).