Experiment fails to "tidy up" and "resubmit"

joakimkjellsson commented 2 years ago

Good afternoon all

It is happened to me a few times now that an experiment does not resubmit itself after a finished leg. The most recent example is when running FOCI-OpenIFS on the "feature/focioifs_o12" branch on HLRN (Berlin).

The experiment should do monthly restarts for 5 years (1950-1955), and but stopped after 31 Dec 1951. Everything in the work directory points to a successful run, i.e. the model did not crash. All model components produced restart files, the "debug.root.??" files end with "SUCCESSFUL RUN", and the "log" file from SLURM doesn't show any errors. There is nothing about "disk quota exceeded" or a node failure or anything.

It's like the model finishes but ESM-Tools does not run the "tidy" and "resubmit" tasks. The restart files are not in the "/restart/nemo" directory so ESM-Tools has not been moving anything around after the job finished. No post processing script has been launched either (which it should be). It also has not produced a run directory "run_19520101-19520131" for the next leg. It's like ESM-Tools thinks the experiment is over.

If I then run

esm_runscripts focioifs-piCtl-orca12-day31-restart-blogin.yaml -e FOCI_BJK017 -t tidy

then ESM-Tools runs the tidying stuff, runs postprocessing, and resubmits the next leg. Then it works again!

So my question is: Why was "tidy" not run directly? Has anyone else encountered problems like this?

I'm using the "feature/focioifs_o12" branch of ESM-Tools on HLRN (Berlin). I'm not using any other models or machines at the moment, so I'm not sure if it's a model or machine specific problem.

Many thanks Joakim

mandresm commented 2 years ago

Hi Joakim, that's a puzzling issue. I have not seen something like this recently. Could you provide access to an experiment that failed to tidy up, and that was not restarted, so I can have a look?

joakimkjellsson commented 2 years ago

Hi @mandresm Unfortunately I deleted the run dir already. My inodes and disk space were way beyond what HLRN allows ;-) After I ran the "tidy" script, the model restarted and kept going just fine, so it is puzzling.

After talking with HLRN support I have a speculation of what the problem might have been. HLRN have been trying to save energy (as all Germans should nowadays) by putting idle compute nodes to sleep. Problem is, they don't always "wake up" when they are supposed to. I've had jobs that start but just sit there without doing anything. Maybe that was the issue here?

Anyway, this energy saver has been disabled now, and I haven't had any problems since. Should it happen again, I'll put all the log files here.

Feel free to close the issue for now. We can re-open an issue later, if needed, right?

Best wishes Joakim

nwieters commented 1 year ago

Unfortunately I have the same issue on juwels. It will only run the steps prepcompute and compute but not observe_compute and tidy as on levante.

Experiment on juwels: /p/scratch/chhb19/wieters1/runtime/awicm3-v3.1/run_006_initial/

I will try to figure out what causes this different jobsteps on the two different machines.

joakimkjellsson commented 1 year ago

Hi @nwieters

After a few months I have noticed that this problem only occurs when I'm running with

#SBATCH --qos=preempt

on HLRN, i.e. running on unused nodes without charging my budget. My speculation is that sometimes, the "tidy" script gets kicked off the queue as the nodes it was supposed to use suddenly are requested by someone with a budget. The problem did not occur once during a 700-year run that I did without the "preempt" feature.

Not sure if JUWELS has "preempt" activated, but on HLRN I suspect this was the culprit.

Best wishes Joakim

mandresm commented 1 year ago

@nwieters, you have a Python error right at the beginning of the log file. This means ESM-Tools tidy scripts or any python action won't be able to run. My guess is that you are having a conflict of Stages. You probably installed ESM-Tools using the 2023 Stages, while ESM-Tools should only be run in Stages 2022. This was a decision made to keep stability and not to depend on Juwels Stages upgrade schedule which has given us big headaches on the past.

If loading and installing the correct Stage in Juwels solves the problem, we probably want to document that here: https://esm-tools.readthedocs.io/en/latest/installation.html#before-you-continue, to be added in the Juwels bash_profile recommendation.

nwieters commented 1 year ago

@nwieters, you have a Python error right at the beginning of the log file. This means ESM-Tools tidy scripts or any python action won't be able to run. My guess is that you are having a conflict of Stages. You probably installed ESM-Tools using the 2023 Stages, while ESM-Tools should only be run in Stages 2022. This was a decision made to keep stability and not to depend on Juwels Stages upgrade schedule which has given us big headaches on the past.

If loading and installing the correct Stage in Juwels solves the problem, we probably want to document that here: https://esm-tools.readthedocs.io/en/latest/installation.html#before-you-continue, to be added in the Juwels bash_profile recommendation.

Hi @mandresm, thank you for the hint. I saw this error also this morning and came up with the same idea as you descripbed, that I installed esm-tool with the wrong modules. I will try it with Stages 2022 and Python 3.9.6. If this works I can also update the documentation.

nwieters commented 1 year ago

Hi @joakimkjellsson, thanks for your tipp. I have not found out whether preempt is activated on juwels. There was not much in the juwels documentation related to it. But in my case I had the wrong modules loaded before I installed esm-tools (as @mandresm suggested).

For awicm3 it now works with module load Stages/2022 module load Python/3.9.6 and then ./install.sh

I am still not sure, if this is depending on the model or setup? I will try to run another setup with this esm-tools installation.

esm-tools / esm_tools

Experiment fails to "tidy up" and "resubmit" #741