geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
62 stars 23 forks source link

submit_jobs.bash improvements #465

Open falkamelung opened 3 years ago

falkamelung commented 3 years ago
run_09_merge_burst_igram: 31 jobs; 15 COMPLETED, 5 RUNNING , 11 PENDING.
run_09_merge_burst_igram: 31 jobs; 15 COMPLETED, 5 RUNNING , 11 PENDING.
run_09_merge_burst_igram: 31 jobs; 16 COMPLETED, 4 RUNNING , 11 PENDING.

show:

BalotschistanSenAT115, run_09_merge_burst_igram 31 jobs: 10 waiting, 1 PENDING, 4 RUNNING.
BalotschistanSenAT115, run_09_merge_burst_igram 31 jobs:  8 waiting, 3 PENDING, 4 RUNNING.
BalotschistanSenAT115, run_09_merge_burst_igram 31 jobs:  5 waiting, 2 PENDING, 6 RUNNING.
BalotschistanSenAT115, run_09_merge_burst_igram 31 jobs:  0 waiting, 0 PENDING, 2 RUNNING.

( use 2-digit formatting: '31' and ' 5' )

Ovec8hkin commented 3 years ago

Regarding logging improvement: What constitutes "waiting" status type? Regarding Support multiple queues: is the $QUEUE variable always set? Can I just use that to lookup the limit from qlimits?

Regarding logging format fix: I have seen this before, I thought I fixed it. Its a bug in how I compute whether to not to split the filename. Easy fix

Regarding Logging: replace 'Active Jobs' with 'Processed Jobs': I disagree here. That table is meant to monitor the amount of resources being allocated to a set of running jobs, not to monitor the actual status of a set of running jobs. Besides, you should always be able to quickly derive the the "10/54" data yourself just by glancing at the table, I don't see any benefit of adding that information to the table.

Regarding Overview over all running jobs: This is hard, and should be marked very low priority.

falkamelung commented 3 years ago

With 'waiting' jobs I mean those jobs that submit_jobs.bash has not yet submitted. So if a step has 80 jobs 10 are RUNNING, 10 PENDING and 20 COMPLETED there should be 40 in 'waiting'. 'waiting-to-submit' would be more precise (but too long).

Regarding Processed jobs the 25 is irrelevant. That is what submit_jobs does. You don't even need to know that there is a 25 job limit. The user want to know how many are missing. That's why '10/54' is useful.

To say '10 COMPLETED' is not useful. It has to be '10/54'. In a similar way '20 waiting, 10 PENDNG' is useful without the 54 information.

Regarding multiple queues, you have to use scontrol to find the queue name (the may be an option that just retunes the queue name)

scontroldd 7587266
JobId=7587266 JobName=run_08_generate_burst_igram_16
   UserId=tg851601(851601) GroupId=G-820134(820134) MCS_label=N/A
   Priority=561 Nice=0 Account=TG-EAR200012 QOS=normal
   JobState=PENDING Reason=Priority Dependency=(null)
   Requeue=0 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   DerivedExitCode=0:0
   RunTime=00:00:00 TimeLimit=00:12:00 TimeMin=N/A
   SubmitTime=2021-04-15T22:03:30 EligibleTime=2021-04-15T22:03:30
   AccrueTime=2021-04-15T22:03:30
   StartTime=Unknown EndTime=Unknown Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2021-04-15T22:26:42
   Partition=skx-normal AllocNode:Sid=login3:295095
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=(null)
   NumNodes=1-1 NumCPUs=48 NumTasks=48 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=48,node=1,billing=48
   Socks/Node=* NtasksPerN:B:S:C=0:0:*:* CoreSpec=*
   MinCPUsNode=1 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   OverSubscribe=NO Contiguous=0 Licenses=(null) Network=(null)
   Command=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16.job
   WorkDir=/scratch/05861/tg851601/BalotschistanSenAT13
   StdErr=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16_%J.e
   StdIn=/dev/null
   StdOut=/scratch/05861/tg851601/BalotschistanSenAT13/run_files/run_08_generate_burst_igram_16_%J.o
   Power=

The multiple queue feature is the top priority. Frontera has 3 similar queues to which we can submit.

Ovec8hkin commented 3 years ago

Re logging improvement: That text string will only start printing after all of the jobs associated with a task have been submitted, so there will never be any "WAITING" jobs to show.

falkamelung commented 3 years ago

And I had waiting in lower-case letters to separate from PENDING, RUNNING from SLURM

Ovec8hkin commented 3 years ago

That last comment didn't tell me anything. Is this scenario fine? Since those log lines won't appear until everything has been submitted, I don't see a point in leaving a "waiting" count, since it will always be 0.

falkamelung commented 3 years ago

I see about the "waiting" count. At that stage it would always be 0 as it only appears when there is nothing waiting.

What I mean is shown below. We need Step active tasks Total active tasks Step active jobs Total active jobs. Our current Active jobs is the same as Total Active jobs. Step active jobs would in this case be 3/26 as 3 out of 26 jobs has been submitted. (we have some steps with 80 jobs and it would show 23/80 or similar.)

We probably don't need Total active jobs but leave it in for now. I can remove later.

For the jobs it should talk a

/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_14.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_15.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_16.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_17.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_18.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_19.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_20.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_21.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_22.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_23.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_24.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_25.job
/scratch/05861/tg851601/BalotschistanSenAT115/run_files/run_07_merge_reference_secondary_slc_26.job
-------------------------------------------------------------------------------------------------------------------------
| File Name            | Additional Tasks | Step Active Tasks | Total Active Tasks | Active Jobs | Message              |  
-------------------------------------------------------------------------------------------------------------------------
| run_07_mer...y_slc_0 | 6                | 0/500             | 191/600            | 22/25       | Submitted: 7611102   |
| run_07_mer...y_slc_1 | 6                | 6/500             | 197/600            | 23/25       | Submitted: 7611108   |
| run_07_mer...y_slc_2 | 6                | 12/500            | 203/600            | 24/25       | Submitted: 7611111   |
| run_07_mer...y_slc_3 | 6                | 18/500            | 209/600            | 25/25       | Wait 5 min           |
Ovec8hkin commented 3 years ago

Correction concerning logging improvement: jobs could enter the WAITING state after initial job submission if jobs fail and need are resubmitted. I don't print the table in that case, and thus jobs could briefly enter a 'waiting' state between submission and pending status. It will happen rarely, but is possible.

falkamelung commented 3 years ago

For the format please allow digits for 1000/3000 and 8888/9000 step and total tasks

---------------------------------------------------------------------------------------------------------------------------------------------------------
|                                        |       | Step    | Total   | Step      |        | 
|                                        | Extra | active  | actives | processed | Active |
| File name                              | tasks | tasks   | tasks   | jobs      | jobs   | Message                             
--------------------------------------------------------------------------------------------------------------
KokoxiliChunk36SenDT150/run_07_..._0.job |  9    | 160/500 | 160/1000 | 1/18     | 17/25  | Submitted: 7846786                  
KokoxiliChunk36SenDT150/run_07_..._1.job |  9    | 160/500 | 160/1000 | 2/18     | 18/25  | Submitted: 7846788                  
KokoxiliChunk36SenDT150/run_07_..._2.job |  9    | 169/500 | 169/1000 | 3/18     | 19/25  | Submitted: 7846796                  
KokoxiliChunk36SenDT150/run_07_..._3.job |  9    | 187/500 | 178/1000 | 4/18     | 21/25  | Submitted: 7846801                  
KokoxiliChunk36SenDT150/run_07_..._4.job |  9    | 205/500 | 196/1000 | 5/18     | 23/25  | Submitted: 7846805                  
KokoxiliChunk36SenDT150/run_07_..._5.job |  9    | 223/500 | 214/1000 | 6/18     | 25/25  | waiting 5 min     
KokoxiliChunk36SenDT150/run_07_..._5.job |  9    | 106/500 | 106/1000 | 6/18     | 8/25   | Submitted: 7846821                  
KokoxiliChunk36SenDT150/run_07_..._6.job |  9    | 108/500 | 108/1000 | 7/18     | 11/25  | Submitted: 7846826                  
KokoxiliChunk36SenDT150/run_07_..._7.job |  9    | 135/500 | 126/1000 | 8/18     | 15/25  | Submitted: 7846834                  
KokoxiliChunk36SenDT150/run_07_..._8.job |  9    | 153/500 | 162/1000 | 9/18     | 19/25  | Submitted: 7846842                  
KokoxiliChunk36SenDT150/run_07_..._9.job |  9    | 189/500 | 189/1000 | 10/18    | 23/25  | Submitted: 7846850                  
KokoxiliChunk36SenDT150/run_07_...10.job |  9    | 207/500 | 225/1000 | 11/18    | 25/25  | waiting 5 min        
Ovec8hkin commented 3 years ago
  • Overview over all running jobs It would be great to have a functionjob_status that shows what is running and where they are. This may require a ~/.minsar file or folder to keep e.g. BalotschistanSenAT115 run_11_unwrap. The number of jobs can then be retrieved from the BalotschistanSenAT115/run_files directory (run_11_unwrap_*.job and run_11_unwrap_*.o but just get files such as run_11_unwrap_14_7584060.o but not run_11_unwrap_14_20200211_1.o (the occurrence of _1 is what distinguishes).
job_status
Currently running:
BalotschistanSenAT115  run_11_unwrap:   10/54 jobs completed
BalotschistanSenDT151  run_09_merge_burst_igram:   18/23 jobs completed
BalotschistanSenDT49  run_04_fullBurst_geo2rdr:   0/5 jobs completed

Please expound on how to check the number of completed jobs based on the output files and I will write this script tomorrow. I have finally got annoyed being unable to quickly check how many jobs from a workflow have already completed.

falkamelung commented 3 years ago

We sort-of have solved this using the process.log. All we need is some shall commands/functions that look at the process*.logs.

There are some functions already in accounts/alias.bash (lista is sophisicated). We could have ls-style options

job_status *SenDT150                  :   lists all *SenDT150/process.log
job_status                                      :   lists all */process.log
job_status  -l                                 :  ls -l of all */process.log
job_status  -t 5 *SenDT150         :  lists last 5 lines of  *SenDT150/process.log  (-t for tail command)

I think we should rename to submit_jobs.log. Regarding the numbering I have seen isce.1.log, isce.2.log instead of isce1.log, isce2.log. So I suspect that this is the convention

falkamelung commented 3 years ago

To what I can tell everything works fine (there are still ISCE issues but this is different). It would be nice to add a submit_jobs.cfg where all parameters are set (number of total tasks, task per step and wait times). Ideally it is such that you can overwrite all parameters by setting an environment variable such as MAX_JOBS_PER_QUEUE. (at the moment there is confusion which parameters can be overwritten using an environment variable). To avoid the confusion I would suggest to prefix all variables with SSJOBS for submit_jobs. So it would be called SSJOBS_MAX_JOBS_PER_QUEUE. SLURM is doing something similar:

https://hpcc.umd.edu/hpcc/help/slurmenv.html https://slurm.schedmd.com/sbatch.html#lbAJ

We should think about how to have different parameters for different platforms based on $PLATFORM_NAME (I believe we have this in utils/read_platform_defaults.bash but I have not implemented yet).