Closed aidanheerdegen closed 1 year ago
@utkarshgupta95 will take a look at this, but please let us know if you think I've misunderstood, or the proposed solution is unacceptable @aekiss.
Ah, good catch @aidanheerdegen, I hadn't considered runspersub
.
I don't think the git hash is suitable, because we also want to track the failed jobs, and sometimes these are resubmissions with no git changes.
Why not just use the PBS ID concatenated with a run timestamp? IIRC it doesn't really matter what the key is, so long as it is unique.
Ah, that won't work either - the run completion date is from the PBS log
I don't think the git hash is suitable, because we also want to track the failed jobs, and sometimes these are resubmissions with no git changes.
Ah. Right. So brute force would be PBS ID joined to git hash, e.g. pbs_id.git_commit_hash
I think so, as long as runlog: True
.
Or append PAYU_CURRENT_RUN
from job.yaml
- would that work?
or is PAYU_N_RUNS
what I'm thinking of? Is this a counter of runs within the submission?
I think so, as long as
runlog: True
.
Yes, but I'm happy to list that as a requirement for your tool if you are.
Or append
PAYU_CURRENT_RUN
fromjob.yaml
- would that work? or isPAYU_N_RUNS
what I'm thinking of? Is this a counter of runs within the submission?
PAYU_CURRENT_RUN
is the number of the current run, which corresponds to the numbering of the outputXXX
and restartXXX
directories. You're right PAYU_N_RUNS
is an internal counter that can be >1 when runspersub
is used.
We could append the PAYU_CURRENT_RUN
to the PBS ID to make it unique for the case of runspersub
> 1. For failed runs the PBS ID would be different when it was re-run, so again that would be unique.
I like the idea of using the git hash as it has more value than a run counter, and maybe also solves #17. It has the downside that it requires rung: True
, but I think that is a reasonable restriction.
I am interested in your opinion @aekiss.
Also which of those is easier to code? At this point the key would have to be changed
https://github.com/aekiss/run_summary/blob/master/run_summary.py#L607-L608
Is the PAYU_CURRENT_RUN
available at that point? What about the git commit
hash? Would both require inspection of files to generate the key?
Will https://github.com/aekiss/run_summary/pull/29 fix this?
The key is now a string <jobid>_<PAYU_N_RUNS>
rather than an integer.
https://github.com/aekiss/run_summary/blob/708eff0e0692cb223da9899f55a1114d7c4c535b/run_summary.py#L611
The jobid
part is from the pbs log filename (as before), so exists even if the other job info is missing from that file (as sometimes happens).
The PAYU_N_RUNS
part is either an integer or None
if it's missing from the pbs log file for some reason.
I think this is the most robust solution, as it only requires the existence of the pbs log file, which may even be incomplete, so enables scraping info from more failed runs.
The key is arbitrary and just needs to be unique. It shouldn't be used as data (I've fixed the one instance where I did this).
Although less informative, <jobid>_<PAYU_N_RUNS>
is more general than also needing to query git. And I'm not sure a git key would help with #17.
PAYU_CURRENT_RUN
would also require parsing another file, which is why I opted for PAYU_N_RUNS
. Am I right in thinking PAYU_N_RUNS
is unique for each run within a pbs job?
Am I right in thinking
PAYU_N_RUNS
is unique for each run within a pbs job?
Might depend where you get it from.
[aph502@gadi-login-02 archive]$ grep PAYU_N_RUNS output00*/job.yaml
output000/job.yaml:PAYU_N_RUNS: 1
output001/job.yaml:PAYU_N_RUNS: 1
output002/job.yaml:PAYU_N_RUNS: 3
output003/job.yaml:PAYU_N_RUNS: 2
output004/job.yaml:PAYU_N_RUNS: 1
[aph502@gadi-login-02 archive]$ grep PAYU_N_RUNS output00*/env.yaml
output002/env.yaml:PAYU_N_RUNS: '3'
output003/env.yaml:PAYU_N_RUNS: '3'
output004/env.yaml:PAYU_N_RUNS: '3'
Note that in env.yaml
it doesn't change, because that was the value when the PBS job was started. In job.yaml
it decrements every time.
You can just get the canonical run number from the output directory name, which is similar in intent to getting the PBS ID from the logfile name. In the case of failed runs that isn't an option, so I guess you'd need a try/except block when looking for the output directory? TBH I haven't gone into the logic in detail about how you deal with failed runs, so what I'm suggesting might not be a great idea.
It isn't clear to me how you get PAYU_N_RUNS
, as it seems like you're parsing the PBS output file for them, but as far as I can tell it doesn't appear in them:
[aph502@gadi-login-02 pbs_logs]$ ls
1deg_jra55_ry_c.e89650082 1deg_jra55_ry_c.e90353749 1deg_jra55_ry_c.o89652142 1deg_jra55_ry_c.o90354701 1deg_jra55_ryf.e89650628 1deg_jra55_ryf.o89648460
1deg_jra55_ry_c.e89652142 1deg_jra55_ry_c.e90354701 1deg_jra55_ry_c.o90352230 1deg_jra55_ryf.e89648068 1deg_jra55_ryf.e90350959 1deg_jra55_ryf.o89650628
1deg_jra55_ry_c.e90352230 1deg_jra55_ry_c.o89650082 1deg_jra55_ry_c.o90353749 1deg_jra55_ryf.e89648460 1deg_jra55_ryf.o89648068 1deg_jra55_ryf.o90350959
[aph502@gadi-login-02 pbs_logs]$ grep PAYU_N_RUNS *
[aph502@gadi-login-02 pbs_logs]$
but PAYU_CURRENT_RUN
does
qsub -q normal -P tm70 -l walltime=3600 -l ncpus=4 -l mem=30GB -N 1deg_jra55_ry_c -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,PAYU_CURRENT_RUN=4,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-collate
Hm, maybe it depends on the payu version that was used, asPAYU_N_RUNS
does show up in some cases
$ grep PAYU_N_RUNS ~aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/*.o*
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/01deg_jra55_iaf.o57831648:qsub -q normal -P x77 -l walltime=14400 -l ncpus=12144 -l mem=48576GB -N 01deg_jra55_iaf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin,PAYU_CURRENT_RUN=589,PAYU_N_RUNS=2,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_unsw:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+gdata/qv56+scratch/v45 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/payu-run
/home/156/aek156/payu/01deg_jra55v140_iaf_cycle3_repro_test/01deg_jra55_iaf.o57841063:qsub -q normal -P x77 -l walltime=14400 -l ncpus=12144 -l mem=48576GB -N 01deg_jra55_iaf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin,PAYU_CURRENT_RUN=590,PAYU_N_RUNS=1,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data/hh5/public/modules:/etc/scl/modulefiles:/apps/Modules/restricted-modulefiles/matlab_unsw:/apps/Modules/restricted-modulefiles/matlab_anu:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+gdata/qv56+scratch/v45 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-22.04/bin/payu-run
I've just pushed a new commit that uses a key that concatenates the PBS jobid, PAYU_CURRENT_RUN
and PAYU_N_RUNS
, with the latter two set to None
if missing. Hopefully that covers all bases?
Hm, maybe it depends on the payu version that was used, as
PAYU_N_RUNS
does show up in some cases
Maybe. How odd.
PAYU_N_RUNS
is specified in the payu
command line invocation when the number of runs is > 1
$ payu run -n 6
payu: warning: Job request includes 47 unused CPUs.
payu: warning: CPU request increased from 241 to 288
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=10800 -l ncpus=288 -l mem=1000GB -N 1deg_jra55_ryf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,PAYU_N_RUNS=6,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-run
but not when runspersub==1
$ payu run
payu: warning: Job request includes 47 unused CPUs.
payu: warning: CPU request increased from 241 to 288
Loading input manifest: manifests/input.yaml
Loading restart manifest: manifests/restart.yaml
Loading exe manifest: manifests/exe.yaml
payu: Found modules in /opt/Modules/v4.3.0
qsub -q normal -P tm70 -l walltime=10800 -l ncpus=288 -l mem=1000GB -N 1deg_jra55_ryf -l wd -j n -v PAYU_PATH=/g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin,MODULESHOME=/opt/Modules/v4.3.0,MODULES_CMD=/opt/Modules/v4.3.0/libexec/modulecmd.tcl,MODULEPATH=/g/data3/hh5/public/modules:/etc/scl/modulefiles:/etc/scl/modulefiles:/opt/Modules/modulefiles:/opt/Modules/v4.3.0/modulefiles:/apps/Modules/modulefiles -W umask=027 -l storage=gdata/hh5+gdata/ik11+scratch/tm70 -- /g/data/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/python3.9 /g/data3/hh5/public/apps/miniconda3/envs/analysis3-23.04/bin/payu-run
TBH I think just PBS jobid and PAYU_CURRENT_RUN
$ grep PAYU_CURRENT_RUN archive/pbs_logs/*.o* | cut -d, -f2
PAYU_CURRENT_RUN=0
PAYU_CURRENT_RUN=1
PAYU_CURRENT_RUN=2
PAYU_CURRENT_RUN=3
PAYU_CURRENT_RUN=4
PAYU_CURRENT_RUN=5
PAYU_CURRENT_RUN=6
PAYU_CURRENT_RUN=7
$
Are we sure PAYU_CURRENT_RUN
will always be available? I'd put a cryptic comment in the code years ago to say sometimes this fails, which is why I thought I should include both PAYU_N_RUNS
and PAYU_CURRENT_RUN
. At worst it's harmlessly over-thorough, and at best it allows indexing of some old and weirdly incomplete PBS logs. The key is arbitrary (except that it must be unique) and is not intended to be used as data - e.g. jobs are not sorted using it.
Are we sure
PAYU_CURRENT_RUN
will always be available?
I guess I can't guarantee that for some older runs. In which cases belt and braces should be fine I guess.
Terminology:
In this issue I refer to run, which means an invocation of payu run, where the model reads in restart files, runs for a period of time, writes restart and output files, which are then archived.
An experiment is a sequential series of runs.
When run_summary.py is run on an experiment it gathers data from a number of runs and uses the PBS ID is used as the primary key when storing information from each run in a python
dict
calledrun_data
, e.g.https://github.com/aekiss/run_summary/blob/master/run_summary.py#L607C38-L616
When runspersub is greater than 1 there is more than one run with the same PBS ID.
This means data is lost, if there is more than one run with the same PBS ID the data from previous runs will be overwritten by data from subsequent runs.
I believe the best fix for this is to use the
git
commit ID as the key for therun_data
dict
.A possible downside of this is that a run with
runlog: False
is not supported. However this is uncommon, and as long as it is clear that this is not possible I think this is acceptable.Does this also address/fix https://github.com/aekiss/run_summary/issues/17?