ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

Error in retry of a failed run #347

Closed penguian closed 5 years ago

penguian commented 6 years ago

resolution_fixed | by mrd599@nci.org.au


Suites are configured to automatically retry on failure. This works ok for an MPI related failure right at the start.

However if the model runs a few days and writes the partial sum files the retry will fail because these partial sums are later than the restarted model time.


Issue migrated from trac:347 at 2024-01-31 18:32:39 +1100

penguian commented 6 years ago

@martin.dix@anu.edu.au commented


Not an issue for the Met Office because they write both restarts and partial sums every 10 days.

Gregorian calendar run writes the partial sums every day.

Removing them before retry should be ok as long as we only save monthly means from the UM, not seasonal means.

penguian commented 5 years ago

@martin.dix@anu.edu.au changed status from assigned to closed

penguian commented 5 years ago

@martin.dix@anu.edu.au set resolution to fixed

penguian commented 5 years ago

@martin.dix@anu.edu.au commented


Writing the partial sums to /jobfs gets around this problem because it's not persistent. It should also be more efficient than using /short for these files.

https://code.metoffice.gov.uk/trac/roses-u/changeset/106959/b/f/4/8/1/trunk.