NOAA-GFDL / fre-workflows

Code to generate, describe, validate, and configure scientific workflows within the FRE software framework
2 stars 4 forks source link

remap-pp-components doesn't confirm that pp components were successfully remapped before exit #13

Open uwagura opened 2 months ago

uwagura commented 2 months ago

There have been several instances where cylc has said that the `remap-pp-components task has executed successfully even though it didn't. Two instances where this occurred for me are:

  1. When the pp_chunk_a variable in my pp yaml did not match the chunksize that fre inferred from my history files, causing the remap script to simply do nothing at this point in the loop: https://github.com/NOAA-GFDL/fre-workflows/blob/c18dedd918a09fef11a24ffcab692ab8d86a3e7b/app/remap-pp-components/bin/remap-pp-components#L378-L380

  2. If the link command fails, the remap script also doesn't do anything when it gets a non-zero return value https://github.com/NOAA-GFDL/fre-workflows/blob/c18dedd918a09fef11a24ffcab692ab8d86a3e7b/app/remap-pp-components/bin/remap-pp-components#L451-L464

These problems were quite difficult to debug becausecylc, job.out, and job.err all seemed to indicate that the remap job had completed successfully when they had just failed silently. There could also be several other points where this loop could fail / skip an iteration and the function would still return 0 and print "Component reamapping complete" without actually executing the copy command.

I think it would be helpful to at least add a check before printing that the remapping completed successfully, so that if the files aren't copied to output_dir the user knows to check this script for issues instead of having cylc fail at some future workflow step that may not be related.