PBS ID is not unique if runspersub > 1

aidanheerdegen commented 1 year ago

Terminology:

In this issue I refer to run, which means an invocation of payu run, where the model reads in restart files, runs for a period of time, writes restart and output files, which are then archived.

An experiment is a sequential series of runs.

When run_summary.py is run on an experiment it gathers data from a number of runs and uses the PBS ID is used as the primary key when storing information from each run in a python dict called run_data, e.g.

https://github.com/aekiss/run_summary/blob/master/run_summary.py#L607C38-L616

When runspersub is greater than 1 there is more than one run with the same PBS ID.

This means data is lost, if there is more than one run with the same PBS ID the data from previous runs will be overwritten by data from subsequent runs.

I believe the best fix for this is to use the git commit ID as the key for the run_data dict.

A possible downside of this is that a run with runlog: False is not supported. However this is uncommon, and as long as it is clear that this is not possible I think this is acceptable.

Does this also address/fix https://github.com/aekiss/run_summary/issues/17?

aidanheerdegen commented 1 year ago

@utkarshgupta95 will take a look at this, but please let us know if you think I've misunderstood, or the proposed solution is unacceptable @aekiss.

aekiss commented 1 year ago

Ah, good catch @aidanheerdegen, I hadn't considered runspersub.

I don't think the git hash is suitable, because we also want to track the failed jobs, and sometimes these are resubmissions with no git changes.

Why not just use the PBS ID concatenated with a run timestamp? IIRC it doesn't really matter what the key is, so long as it is unique.

aekiss commented 1 year ago

Ah, that won't work either - the run completion date is from the PBS log

aidanheerdegen commented 1 year ago

I don't think the git hash is suitable, because we also want to track the failed jobs, and sometimes these are resubmissions with no git changes.

Ah. Right. So brute force would be PBS ID joined to git hash, e.g. pbs_id.git_commit_hash

aekiss commented 1 year ago

I think so, as long as runlog: True. Or append PAYU_CURRENT_RUN from job.yaml - would that work?

aekiss commented 1 year ago

or is PAYU_N_RUNS what I'm thinking of? Is this a counter of runs within the submission?

aidanheerdegen commented 1 year ago

I think so, as long as runlog: True.

Yes, but I'm happy to list that as a requirement for your tool if you are.

Or append PAYU_CURRENT_RUN from job.yaml - would that work? or is PAYU_N_RUNS what I'm thinking of? Is this a counter of runs within the submission?

PAYU_CURRENT_RUN is the number of the current run, which corresponds to the numbering of the outputXXX and restartXXX directories. You're right PAYU_N_RUNS is an internal counter that can be >1 when runspersub is used.

We could append the PAYU_CURRENT_RUN to the PBS ID to make it unique for the case of runspersub > 1. For failed runs the PBS ID would be different when it was re-run, so again that would be unique.

I like the idea of using the git hash as it has more value than a run counter, and maybe also solves #17. It has the downside that it requires rung: True, but I think that is a reasonable restriction.

I am interested in your opinion @aekiss.

Also which of those is easier to code? At this point the key would have to be changed

https://github.com/aekiss/run_summary/blob/master/run_summary.py#L607-L608

Is the PAYU_CURRENT_RUN available at that point? What about the git commit hash? Would both require inspection of files to generate the key?

aekiss commented 1 year ago

Will https://github.com/aekiss/run_summary/pull/29 fix this?

The key is now a string <jobid>_<PAYU_N_RUNS> rather than an integer. https://github.com/aekiss/run_summary/blob/708eff0e0692cb223da9899f55a1114d7c4c535b/run_summary.py#L611 The jobid part is from the pbs log filename (as before), so exists even if the other job info is missing from that file (as sometimes happens). The PAYU_N_RUNS part is either an integer or None if it's missing from the pbs log file for some reason.

I think this is the most robust solution, as it only requires the existence of the pbs log file, which may even be incomplete, so enables scraping info from more failed runs.

The key is arbitrary and just needs to be unique. It shouldn't be used as data (I've fixed the one instance where I did this).

Although less informative, <jobid>_<PAYU_N_RUNS> is more general than also needing to query git. And I'm not sure a git key would help with #17.

PAYU_CURRENT_RUN would also require parsing another file, which is why I opted for PAYU_N_RUNS. Am I right in thinking PAYU_N_RUNS is unique for each run within a pbs job?

aidanheerdegen commented 1 year ago

Am I right in thinking PAYU_N_RUNS is unique for each run within a pbs job?

Might depend where you get it from.

[aph502@gadi-login-02 archive]$ grep PAYU_N_RUNS output00*/job.yaml
output000/job.yaml:PAYU_N_RUNS: 1
output001/job.yaml:PAYU_N_RUNS: 1
output002/job.yaml:PAYU_N_RUNS: 3
output003/job.yaml:PAYU_N_RUNS: 2
output004/job.yaml:PAYU_N_RUNS: 1
[aph502@gadi-login-02 archive]$ grep PAYU_N_RUNS output00*/env.yaml
output002/env.yaml:PAYU_N_RUNS: '3'
output003/env.yaml:PAYU_N_RUNS: '3'
output004/env.yaml:PAYU_N_RUNS: '3'

Note that in env.yaml it doesn't change, because that was the value when the PBS job was started. In job.yaml it decrements every time.

You can just get the canonical run number from the output directory name, which is similar in intent to getting the PBS ID from the logfile name. In the case of failed runs that isn't an option, so I guess you'd need a try/except block when looking for the output directory? TBH I haven't gone into the logic in detail about how you deal with failed runs, so what I'm suggesting might not be a great idea.

It isn't clear to me how you get PAYU_N_RUNS, as it seems like you're parsing the PBS output file for them, but as far as I can tell it doesn't appear in them:

[aph502@gadi-login-02 pbs_logs]$ ls
1deg_jra55_ry_c.e89650082  1deg_jra55_ry_c.e90353749  1deg_jra55_ry_c.o89652142  1deg_jra55_ry_c.o90354701  1deg_jra55_ryf.e89650628  1deg_jra55_ryf.o89648460
1deg_jra55_ry_c.e89652142  1deg_jra55_ry_c.e90354701  1deg_jra55_ry_c.o90352230  1deg_jra55_ryf.e89648068   1deg_jra55_ryf.e90350959  1deg_jra55_ryf.o89650628
1deg_jra55_ry_c.e90352230  1deg_jra55_ry_c.o89650082  1deg_jra55_ry_c.o90353749  1deg_jra55_ryf.e89648460   1deg_jra55_ryf.o89648068  1deg_jra55_ryf.o90350959
[aph502@gadi-login-02 pbs_logs]$ grep PAYU_N_RUNS *
[aph502@gadi-login-02 pbs_logs]$

but PAYU_CURRENT_RUN does

qsub -q normal -P tm70 -l walltime=3600 -l ncpus=4 -l mem=30GB -N 1deg_jra55_ry_c -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,PAYU_CURRENT_RUN=4,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-collate

aekiss commented 1 year ago

Hm, maybe it depends on the payu version that was used, asPAYU_N_RUNS does show up in some cases

$ grep PAYU_N_RUNS ~aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/*.o*
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/01deg_jra55_iaf.o57831648:qsub -q normal -P x77 -l walltime=14400 -l ncpus=12144 -l mem=48576GB -N 01deg_jra55_iaf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin,PAYU_CURRENT_RUN=589,PAYU_N_RUNS=2,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_unsw:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+gdata/qv56+scratch/v45 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/payu-run
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/01deg_jra55_iaf.o57841063:qsub -q normal -P x77 -l walltime=14400 -l ncpus=12144 -l mem=48576GB -N 01deg_jra55_iaf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin,PAYU_CURRENT_RUN=590,PAYU_N_RUNS=1,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_unsw:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+gdata/qv56+scratch/v45 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/payu-run

aekiss commented 1 year ago

I've just pushed a new commit that uses a key that concatenates the PBS jobid, PAYU_CURRENT_RUN and PAYU_N_RUNS, with the latter two set to None if missing. Hopefully that covers all bases?

aidanheerdegen commented 1 year ago

Hm, maybe it depends on the payu version that was used, asPAYU_N_RUNS does show up in some cases

Maybe. How odd.

PAYU_N_RUNS is specified in the payu command line invocation when the number of runs is > 1

$ payu run -n 6
payu: warning: Job request includes 47 unused CPUs.
payu: warning: CPU request increased from 241 to 288
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=10800 -l ncpus=288 -l mem=1000GB -N 1deg_jra55_ryf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,PAYU_N_RUNS=6,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-run

but not when runspersub==1

$ payu run
payu: warning: Job request includes 47 unused CPUs.
payu: warning: CPU request increased from 241 to 288
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=10800 -l ncpus=288 -l mem=1000GB -N 1deg_jra55_ryf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-run

TBH I think just PBS jobid and PAYU_CURRENT_RUN

$ grep PAYU_CURRENT_RUN archive/pbs_logs/*.o* | cut -d, -f2
PAYU_CURRENT_RUN=0
PAYU_CURRENT_RUN=1
PAYU_CURRENT_RUN=2
PAYU_CURRENT_RUN=3
PAYU_CURRENT_RUN=4
PAYU_CURRENT_RUN=5
PAYU_CURRENT_RUN=6
PAYU_CURRENT_RUN=7
$

aekiss commented 1 year ago

Are we sure PAYU_CURRENT_RUN will always be available? I'd put a cryptic comment in the code years ago to say sometimes this fails, which is why I thought I should include both PAYU_N_RUNS and PAYU_CURRENT_RUN. At worst it's harmlessly over-thorough, and at best it allows indexing of some old and weirdly incomplete PBS logs. The key is arbitrary (except that it must be unique) and is not intended to be used as data - e.g. jobs are not sorted using it.

aidanheerdegen commented 1 year ago

Are we sure PAYU_CURRENT_RUN will always be available?

I guess I can't guarantee that for some older runs. In which cases belt and braces should be fine I guess.

aekiss / run_summary

PBS ID is not unique if runspersub > 1 #28