ACCESS-NRI / accessdev-Trac-archive

Archive accessdev Trac contents as issues
Apache License 2.0
0 stars 0 forks source link

New filemove can overwrite output before mppncombine runs #328

Closed penguian closed 6 years ago

penguian commented 6 years ago

keyword_mppncombine resolution_fixed | by rb4844


I have had a couple of runs where mppncombine has failed. Appears to be a consequence of long queue times such that when it finally runs the ocean files have gone from the temporary HistoryData folder and, I assume, been transferred across to short.

The consequence of this is then housekeep gets stuck waiting, two more NCI coupled steps in the rose tree are spawned but none after that unless manually intervene to delete the failed steps.

See screen captures.

Eg job.err:

mv: target `ocean_scalar.nc-00010630' is not a directory Received signal EXIT


Issue migrated from trac:328 at 2024-01-31 18:30:20 +1100

penguian commented 6 years ago

rb4844 _uploaded file am528_running_3.08pm_270717.png (2331.3 KiB)_

mppncombine fail 1

penguian commented 6 years ago

rb4844 uploaded file mppncombine fail 2.png (2351.8 KiB)

mppncombine fail 2

penguian commented 6 years ago

rb4844 commented


How do I delete tiff attachments now?

penguian commented 6 years ago

@scott.wales@bom.gov.au commented


Might be because I'm an admin, but I see a 'delete attachment' button when I follow the attachment link

penguian commented 6 years ago

@martin.dix@anu.edu.au commented


Anrold reports the same problem in #327.

The dependency graph is

        [[[ [RESUB] ]]]
           graph = """
                 filemove[-[RESUB]] => coupled => filemove => mppncombine [ '=> housekeep' if HOUSEKEEP else '' ]
           """

so the model depends only on filemove from the previous run. A new run can start if mppcombine isn't complete.

The problem is that the filemove script doesn't add a date stamp to the filenames so that files from a previous run can be overwritten.

Could fix by changing the dependency to

                 mppcombine[-[RESUB]] => coupled => filemove => mppncombine

but more efficient to add dates to filenames so that the model doesn't have to wait unnecessarily.

This is implemented in modified versions of mppcombine.sh and filemove_access.sh in u-ao219. See

https://code.metoffice.gov.uk/trac/roses-u/changeset/48244/a/o/2/1/9

penguian commented 6 years ago

@martin.dix@anu.edu.au changed _comment0 which not transferred by tractive

penguian commented 6 years ago

@martin.dix@anu.edu.au changed status from new to assigned

penguian commented 6 years ago

@martin.dix@anu.edu.au set owner to mrd599

penguian commented 6 years ago

@martin.dix@anu.edu.au commented


This fix can be applied to a running suite by copying the new filemove_access.sh and mppcombine.sh to the cylc-run/SUITE/bin directory on raijin. Note that you can only do this after filemove and mppcombine have both run because it changes the intermediate filenames.

penguian commented 6 years ago

@martin.dix@anu.edu.au changed status from assigned to closed

penguian commented 6 years ago

@martin.dix@anu.edu.au set resolution to fixed

penguian commented 6 years ago

@martin.dix@anu.edu.au changed title from mppncombine failed to New filemove can overwrite output before mppncombine runs