Open falkamelung opened 3 years ago
Regarding logging improvement
: What constitutes "waiting" status type?
Regarding Support multiple queues
: is the $QUEUE
variable always set? Can I just use that to lookup the limit from qlimits
?
Regarding logging format fix
: I have seen this before, I thought I fixed it. Its a bug in how I compute whether to not to split the filename. Easy fix
Regarding Logging: replace 'Active Jobs' with 'Processed Jobs'
: I disagree here. That table is meant to monitor the amount of resources being allocated to a set of running jobs, not to monitor the actual status of a set of running jobs. Besides, you should always be able to quickly derive the the "10/54" data yourself just by glancing at the table, I don't see any benefit of adding that information to the table.
Regarding Overview over all running jobs
: This is hard, and should be marked very low priority.
With 'waiting' jobs I mean those jobs that submit_jobs.bash
has not yet submitted. So if a step has 80 jobs 10 are RUNNING, 10 PENDING and 20 COMPLETED there should be 40 in 'waiting'. 'waiting-to-submit' would be more precise (but too long).
Regarding Processed jobs
the 25 is irrelevant. That is what submit_jobs
does. You don't even need to know that there is a 25 job limit. The user want to know how many are missing. That's why '10/54' is useful.
To say '10 COMPLETED' is not useful. It has to be '10/54'. In a similar way '20 waiting, 10 PENDNG' is useful without the 54 information.
Regarding multiple queues, you have to use scontrol to find the queue name (the may be an option that just retunes the queue name)
scontroldd 7587266
JobId=7587266 JobName=run_08_generate_burst_igram_16
UserId=tg851601(851601) GroupId=G-820134(820134) MCS_label=N/A
Priority=561 Nice=0 Account=TG-EAR200012 QOS=normal
JobState=PENDING Reason=Priority Dependency=(null)
Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
DerivedExitCode=0:0
RunTime=00:00:00 TimeLimit=00:12:00 TimeMin=N/A
SubmitTime=2021-04-15T22:03:30 EligibleTime=2021-04-15T22:03:30
AccrueTime=2021-04-15T22:03:30
StartTime=Unknown EndTime=Unknown Deadline=N/A
PreemptTime=None SuspendTime=None SecsPreSuspend=0
LastSchedEval=2021-04-15T22:26:42
Partition=skx-normal AllocNode:Sid=login3:295095
ReqNodeList=(null) ExcNodeList=(null)
NodeList=(null)
NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
TRES=cpu=48,node=1,billing=48
Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
Features=(null) DelayBoot=00:00:00
OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
Command=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16.job
WorkDir=/scratch/05861/tg851601/BalotschistanSenAT13
StdErr=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16_%J.e
StdIn=/dev/null
StdOut=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16_%J.o
Power=
The multiple queue feature is the top priority. Frontera has 3 similar queues to which we can submit.
Re logging improvement
: That text string will only start printing after all of the jobs associated with a task have been submitted, so there will never be any "WAITING" jobs to show.
And I had waiting
in lower-case letters to separate from PENDING, RUNNING from SLURM
That last comment didn't tell me anything. Is this scenario fine? Since those log lines won't appear until everything has been submitted, I don't see a point in leaving a "waiting" count, since it will always be 0.
I see about the "waiting" count. At that stage it would always be 0 as it only appears when there is nothing waiting.
What I mean is shown below. We need Step active tasks
Total active tasks
Step active jobs
Total active jobs
. Our current Active jobs
is the same as Total Active jobs
.
Step active jobs
would in this case be 3/26 as 3 out of 26 jobs has been submitted. (we have some steps with 80 jobs and it would show 23/80 or similar.)
We probably don't need Total active jobs
but leave it in for now. I can remove later.
For the jobs it should talk a
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_14.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_15.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_16.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_17.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_18.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_19.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_20.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_21.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_22.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_23.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_24.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_25.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_26.job
-------------------------------------------------------------------------------------------------------------------------
| File Name | Additional Tasks | Step Active Tasks | Total Active Tasks | Active Jobs | Message |
-------------------------------------------------------------------------------------------------------------------------
| run_07_mer...y_slc_0 | 6 | 0/500 | 191/600 | 22/25 | Submitted: 7611102 |
| run_07_mer...y_slc_1 | 6 | 6/500 | 197/600 | 23/25 | Submitted: 7611108 |
| run_07_mer...y_slc_2 | 6 | 12/500 | 203/600 | 24/25 | Submitted: 7611111 |
| run_07_mer...y_slc_3 | 6 | 18/500 | 209/600 | 25/25 | Wait 5 min |
Correction concerning logging improvement
: jobs could enter the WAITING state after initial job submission if jobs fail and need are resubmitted. I don't print the table in that case, and thus jobs could briefly enter a 'waiting' state between submission and pending status. It will happen rarely, but is possible.
For the format please allow digits for 1000/3000 and 8888/9000 step and total tasks
---------------------------------------------------------------------------------------------------------------------------------------------------------
| | | Step | Total | Step | |
| | Extra | active | actives | processed | Active |
| File name | tasks | tasks | tasks | jobs | jobs | Message
--------------------------------------------------------------------------------------------------------------
KokoxiliChunk36SenDT150/run_07_..._0.job | 9 | 160/500 | 160/1000 | 1/18 | 17/25 | Submitted: 7846786
KokoxiliChunk36SenDT150/run_07_..._1.job | 9 | 160/500 | 160/1000 | 2/18 | 18/25 | Submitted: 7846788
KokoxiliChunk36SenDT150/run_07_..._2.job | 9 | 169/500 | 169/1000 | 3/18 | 19/25 | Submitted: 7846796
KokoxiliChunk36SenDT150/run_07_..._3.job | 9 | 187/500 | 178/1000 | 4/18 | 21/25 | Submitted: 7846801
KokoxiliChunk36SenDT150/run_07_..._4.job | 9 | 205/500 | 196/1000 | 5/18 | 23/25 | Submitted: 7846805
KokoxiliChunk36SenDT150/run_07_..._5.job | 9 | 223/500 | 214/1000 | 6/18 | 25/25 | waiting 5 min
KokoxiliChunk36SenDT150/run_07_..._5.job | 9 | 106/500 | 106/1000 | 6/18 | 8/25 | Submitted: 7846821
KokoxiliChunk36SenDT150/run_07_..._6.job | 9 | 108/500 | 108/1000 | 7/18 | 11/25 | Submitted: 7846826
KokoxiliChunk36SenDT150/run_07_..._7.job | 9 | 135/500 | 126/1000 | 8/18 | 15/25 | Submitted: 7846834
KokoxiliChunk36SenDT150/run_07_..._8.job | 9 | 153/500 | 162/1000 | 9/18 | 19/25 | Submitted: 7846842
KokoxiliChunk36SenDT150/run_07_..._9.job | 9 | 189/500 | 189/1000 | 10/18 | 23/25 | Submitted: 7846850
KokoxiliChunk36SenDT150/run_07_...10.job | 9 | 207/500 | 225/1000 | 11/18 | 25/25 | waiting 5 min
- Overview over all running jobs It would be great to have a function
job_status
that shows what is running and where they are. This may require a~/.minsar
file or folder to keep e.g.BalotschistanSenAT115 run_11_unwrap
. The number of jobs can then be retrieved from theBalotschistanSenAT115/run_files
directory (run_11_unwrap_*.job
andrun_11_unwrap_*.o
but just get files such asrun_11_unwrap_14_7584060.o
but notrun_11_unwrap_14_20200211_1.o
(the occurrence of_1
is what distinguishes).job_status Currently running: BalotschistanSenAT115 run_11_unwrap: 10/54 jobs completed BalotschistanSenDT151 run_09_merge_burst_igram: 18/23 jobs completed BalotschistanSenDT49 run_04_fullBurst_geo2rdr: 0/5 jobs completed
Please expound on how to check the number of completed jobs based on the output files and I will write this script tomorrow. I have finally got annoyed being unable to quickly check how many jobs from a workflow have already completed.
We sort-of have solved this using the process.log
. All we need is some shall commands/functions that look at the process*.logs.
There are some functions already in accounts/alias.bash
(lista
is sophisicated). We could have ls
-style options
job_status *SenDT150 : lists all *SenDT150/process.log
job_status : lists all */process.log
job_status -l : ls -l of all */process.log
job_status -t 5 *SenDT150 : lists last 5 lines of *SenDT150/process.log (-t for tail command)
I think we should rename to submit_jobs.log
. Regarding the numbering I have seen isce.1.log
, isce.2.log
instead of isce1.log
, isce2.log
. So I suspect that this is the convention
To what I can tell everything works fine (there are still ISCE issues but this is different). It would be nice to add a submit_jobs.cfg
where all parameters are set (number of total tasks, task per step and wait times). Ideally it is such that you can overwrite all parameters by setting an environment variable such as MAX_JOBS_PER_QUEUE. (at the moment there is confusion which parameters can be overwritten using an environment variable). To avoid the confusion I would suggest to prefix all variables with SSJOBS
for submit_jobs
. So it would be called SSJOBS_MAX_JOBS_PER_QUEUE
. SLURM is doing something similar:
https://hpcc.umd.edu/hpcc/help/slurmenv.html https://slurm.schedmd.com/sbatch.html#lbAJ
We should think about how to have different parameters for different platforms based on $PLATFORM_NAME
(I believe we have this in utils/read_platform_defaults.bash
but I have not implemented yet).
show:
( use 2-digit formatting: '31' and ' 5' )
Support multiple queues Stampede has
skx-normal
andnormal
It should use the MAX_JOBS_PER_QUEUE from each QUEUE (fromdefaults/queues.cfg
or use commandqlimits
) 25 and 50 , respectively. This will also allow to submit toskx-dev
queue when theskx-normal
job limit is reached. Fortasks
the common task limits apply (e.g. 1000 tasks limit over all queues).logging format fix The 'short writing' of
run_11_unwrap_10
got screwed. If you change anyway, may I suggestrun_11_unwr..._10
instead ofrun_11_un...ap_10
Logging: replace 'Active Jobs' with 'Processed Jobs' I think it would make more sense to show 'Processed jobs
10/54
with54
the nunber of jobs for the stepOverview over all running jobs It would be great to have a function
job_status
that shows what is running and where they are. This may require a~/.minsar
file or folder to keep e.g.BalotschistanSenAT115 run_11_unwrap
. The number of jobs can then be retrieved from theBalotschistanSenAT115/run_files
directory (run_11_unwrap_*.job
andrun_11_unwrap_*.o
but just get files such asrun_11_unwrap_14_7584060.o
but notrun_11_unwrap_14_20200211_1.o
(the occurrence of_1
is what distinguishes).