Closed falkamelung closed 3 years ago
I modified sbatch_conditinal to do some logging when this occurs. It appears an max job numbers per queue issue:
run_08_gen..._30.job | 8 | 200/500 | 200/1000 | 31/80 | 25/25 | Wait 5 min |
| run_08_gen..._30.job | 8 | 168/500 | 168/1000 | 31/80 | 18/25 | Submitted: 7763428 |
| run_08_gen..._31.job | 8 | 176/500 | 184/1000 | 32/80 | 20/25 | Submitted: 7763431 |
| run_08_gen..._32.job | 8 | 184/500 | 184/1000 | 33/80 | 22/25 | Submitted: 7763433 |
| run_08_gen..._33.job | 8 | 192/500 | 200/1000 | 34/80 | 23/25 | Submitted: 7763437 |
| run_08_gen..._34.job | 8 | 200/500 | 208/1000 | 35/80 | 24/25 | sbatch message:
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED
[*] Too many simultaneous jobs in queue.
--> Max job limits for skx-normal = 25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted:
KokoxiliChunk36SenAT12
/scratch/05861/tg851601/KokoxiliChunk36SenAT12/run_files/run_09_merge_burst_igram_37.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_09_mer...m_0.job | 17 | 0/500 | 221/1000 | 1/38 | 23/25 | Submitted: 7774596 |
| run_09_mer...m_1.job | 17 | 17/500 | 271/1000 | 2/38 | 25/25 | Wait 5 min |
| run_09_mer...m_1.job | 17 | 17/500 | 239/1000 | 2/38 | 22/25 | Submitted: 7774612 |
| run_09_mer...m_2.job | 17 | 34/500 | 281/1000 | 3/38 | 25/25 | Wait 5 min |
| run_09_mer...m_2.job | 17 | 34/500 | 256/1000 | 3/38 | 24/25 | Submitted: 7774627 |
| run_09_mer...m_3.job | 17 | 51/500 | 298/1000 | 4/38 | 26/25 | Wait 5 min |
| run_09_mer...m_3.job | 17 | 51/500 | 273/1000 | 4/38 | 24/25 | sbatch message:
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED
[*] Too many simultaneous jobs in queue.
--> Max job limits for skx-normal = 25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted:
KokoxiliChunk32SenAT12
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_15.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_16.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_17.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_18.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_19.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_20.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_07_mer...c_0.job | 8 | 0/500 | 254/1000 | 1/21 | 22/25 | Submitted: 7774500 |
| run_07_mer...c_1.job | 8 | 8/500 | 279/1000 | 2/21 | 23/25 | Submitted: 7774502 |
| run_07_mer...c_2.job | 8 | 16/500 | 304/1000 | 3/21 | 25/25 | Wait 5 min |
| run_07_mer...c_2.job | 8 | 8/500 | 176/1000 | 3/21 | 20/25 | Submitted: 7774532 |
| run_07_mer...c_3.job | 8 | 16/500 | 200/1000 | 4/21 | 24/25 | Submitted: 7774536 |
| run_07_mer...c_4.job | 8 | 24/500 | 241/1000 | 5/21 | 25/25 | Wait 5 min |
| run_07_mer...c_4.job | 8 | 16/500 | 169/1000 | 5/21 | 18/25 | Submitted: 7774553 |
| run_07_mer...c_5.job | 8 | 24/500 | 208/1000 | 6/21 | 21/25 | Submitted: 7774557 |
| run_07_mer...c_6.job | 8 | 32/500 | 241/1000 | 7/21 | 24/25 | sbatch message:
-----------------------------------------------------------------
Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------
No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED
[*] Too many simultaneous jobs in queue.
--> Max job limits for skx-normal = 25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted:
Hi @Ovec8hkin : Even for the skx-dev
queue with number of job limit of 1 it shows occasionally 2/1
active jobs (although it does not fail). Would that make it not easy to debug why the number of active jobs is wrongly calculated? You could just run on skx-normal a few concurrent workflows with a job limits of 2 or 3.
KokoxiliChunk30SenDT150
| run_04_ful...r_0.job | 22 | 117/500 | 334/1000 | 1/7 | 0/1 | Submitted: 7814671 |
| run_04_ful...r_1.job | 22 | 162/500 | 319/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 162/500 | 414/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 162/500 | 333/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 162/500 | 378/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 162/500 | 360/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 45/500 | 397/1000 | 2/7 | 2/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 45/500 | 397/1000 | 2/7 | 1/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 0/500 | 426/1000 | 2/7 | 1/1 | Wait 5 min |
| run_04_ful...r_1.job | 22 | 23/500 | 366/1000 | 2/7 | 1/1 | Wait 5 min |
It's almost certainly a race condition. If the submission fails, but the offending file ultimately gets submitted successfully I don't see a big issue. I can try some debugging as you mentioned, but I doubt there's a true solution.
It would be good if you could try some more debugging. This script is key to all our processing. I much prefer to have it bug free as I need to run ~15 submit_jobs.bash workflows concurrently.
If you can't find it we could add after each run step a count of the run*.o files and compare with the number of submitted jobs.
I don't think a race condition can cause errors in counting. It could cause a failed job submission if two concurrent workflows have determined at the same time that there are less than 25 active jobs and try to submit.
Hi @Ovec8hkin : Even for the
skx-dev
queue with number of job limit of 1 it shows occasionally2/1
active jobs (although it does not fail). Would that make it not easy to debug why the number of active jobs is wrongly calculated? You could just run on skx-normal a few concurrent workflows with a job limits of 2 or 3.KokoxiliChunk30SenDT150 | run_04_ful...r_0.job | 22 | 117/500 | 334/1000 | 1/7 | 0/1 | Submitted: 7814671 | | run_04_ful...r_1.job | 22 | 162/500 | 319/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 162/500 | 414/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 162/500 | 333/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 162/500 | 378/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 162/500 | 360/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 45/500 | 397/1000 | 2/7 | 2/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 45/500 | 397/1000 | 2/7 | 1/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 0/500 | 426/1000 | 2/7 | 1/1 | Wait 5 min | | run_04_ful...r_1.job | 22 | 23/500 | 366/1000 | 2/7 | 1/1 | Wait 5 min |
Is this with the most updated copy of the codebase? It doesn't look like it, as I modified that table to report the the reason for failure (see below). I think that, after updating to sbatch_minsar
and some minor structural changes to the control flow, I eliminated this type of error, which I still think is the result of some strange race condition, as I cant replicated it in any capacity.
The new sbatch_conditional
table looks as follows. The 2/1 reporting is indicating that the file being submitted is number 2, with a maximum of 1 allowed. It subsequently failed as expected. If 1/1 was reported, the file would submit successfully (see final line of table), as it would be submission 1 of 1. You can always think of the table as reporting the resource status AFTER the job in question was attempted to be submitted. I can change so as to reflect the status BEFORE the job was submitted (ie. what the actual checks report), but this way make more sense to me.
---------------------------------------------------------------------------------------------------------------------------------------------------------
| File Name | Extra Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| run_01_unp...e_0.job | 1 | 2/1500 | 2/3000 | 1/1 | 2/1 | Submission failed. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_01_unp...e_0.job | 1 | 2/1500 | 2/3000 | 1/1 | 2/1 | Submission failed. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_01_unp...e_0.job | 1 | 2/1500 | 2/3000 | 1/1 | 2/1 | Submission failed. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_01_unp...e_0.job | 1 | 2/1500 | 2/3000 | 1/1 | 2/1 | Submission failed. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_01_unp...e_0.job | 1 | 1/1500 | 1/3000 | 1/1 | 1/1 | Submitted: 7818379 |
---------------------------------------------------------------------------------------------------------------------------------------------------------
No, that was the old version. I just posted this as it shows that a smart way to debug would be to have a job limit of e.g. 3
I mean, I can't do anything without an actual debug output. You need to post a real one for me to see what's going on.
Here is a problem which might have caused the earlier failure.
It submits 34 jobs but is waiting forever for the last job. It turns out that one job has status NODE_FAIL. So we should re-submit for this case in a similar way as for TIMEOUT.
no known data problem found
Jobfiles to run:
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_0.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_1.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_2.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_3.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_4.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_5.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_6.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_7.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_8.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_9.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_10.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_11.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_12.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_13.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_14.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_15.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_16.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_17.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_18.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_19.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_20.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_21.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_22.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_23.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_24.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_25.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_26.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_27.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_28.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_29.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_30.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_31.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_32.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_33.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_33.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_09_mer...m_0.job | 18 | 340/500 | 386/1000 | 1/34 | 20/23 | Submitted: 7822670 |
| run_09_mer...m_1.job | 18 | 358/500 | 395/1000 | 2/34 | 21/23 | Submitted: 7822671 |
| run_09_mer...m_2.job | 18 | 376/500 | 413/1000 | 3/34 | 22/23 | Submitted: 7822673 |
| run_09_mer...m_3.job | 18 | 394/500 | 431/1000 | 4/34 | 23/23 | Wait 5 min |
| run_09_mer...m_3.job | 18 | 394/500 | 394/1000 | 4/34 | 23/23 | Wait 5 min |
| run_09_mer...m_3.job | 18 | 377/500 | 360/1000 | 4/34 | 22/23 | Submitted: 7822749 |
| run_09_mer...m_4.job | 18 | 395/500 | 395/1000 | 5/34 | 23/23 | Wait 5 min |
| run_09_mer...m_4.job | 18 | 121/500 | 160/1000 | 5/34 | 8/23 | Submitted: 7822832 |
| run_09_mer...m_5.job | 18 | 156/500 | 195/1000 | 6/34 | 10/23 | Submitted: 7822835 |
| run_09_mer...m_6.job | 18 | 191/500 | 230/1000 | 7/34 | 11/23 | Submitted: 7822837 |
| run_09_mer...m_7.job | 18 | 226/500 | 265/1000 | 8/34 | 13/23 | Submitted: 7822840 |
| run_09_mer...m_8.job | 18 | 261/500 | 300/1000 | 9/34 | 15/23 | Submitted: 7822842 |
| run_09_mer...m_9.job | 18 | 296/500 | 335/1000 | 10/34 | 17/23 | Submitted: 7822844 |
| run_09_mer..._10.job | 18 | 331/500 | 370/1000 | 11/34 | 19/23 | Submitted: 7822846 |
| run_09_mer..._11.job | 18 | 366/500 | 366/1000 | 12/34 | 21/23 | Submitted: 7822849 |
| run_09_mer..._12.job | 18 | 401/500 | 401/1000 | 13/34 | 23/23 | Wait 5 min |
| run_09_mer..._12.job | 18 | 315/500 | 354/1000 | 13/34 | 18/23 | Submitted: 7822876 |
| run_09_mer..._13.job | 18 | 350/500 | 389/1000 | 14/34 | 20/23 | Submitted: 7822877 |
| run_09_mer..._14.job | 18 | 385/500 | 424/1000 | 15/34 | 21/23 | Submitted: 7822880 |
| run_09_mer..._15.job | 18 | 405/500 | 403/1000 | 16/34 | 22/23 | Submitted: 7822882 |
| run_09_mer..._16.job | 18 | 405/500 | 405/1000 | 17/34 | 23/23 | Wait 5 min |
| run_09_mer..._16.job | 18 | 267/500 | 303/1000 | 17/34 | 8/23 | Submitted: 7822900 |
| run_09_mer..._17.job | 18 | 196/500 | 249/1000 | 18/34 | 9/23 | Submitted: 7822902 |
| run_09_mer..._18.job | 18 | 162/500 | 162/1000 | 19/34 | 10/23 | Submitted: 7822903 |
| run_09_mer..._19.job | 18 | 180/500 | 180/1000 | 20/34 | 11/23 | Submitted: 7822904 |
| run_09_mer..._20.job | 18 | 198/500 | 198/1000 | 21/34 | 12/23 | Submitted: 7822906 |
| run_09_mer..._21.job | 18 | 216/500 | 216/1000 | 22/34 | 13/23 | Submitted: 7822909 |
| run_09_mer..._22.job | 18 | 234/500 | 271/1000 | 23/34 | 14/23 | Submitted: 7822913 |
| run_09_mer..._23.job | 18 | 252/500 | 289/1000 | 24/34 | 15/23 | Submitted: 7822917 |
| run_09_mer..._24.job | 18 | 270/500 | 307/1000 | 25/34 | 16/23 | Submitted: 7822919 |
| run_09_mer..._25.job | 18 | 288/500 | 325/1000 | 26/34 | 17/23 | Submitted: 7822920 |
| run_09_mer..._26.job | 18 | 306/500 | 343/1000 | 27/34 | 18/23 | Submitted: 7822922 |
| run_09_mer..._27.job | 18 | 324/500 | 361/1000 | 28/34 | 19/23 | Submitted: 7822923 |
| run_09_mer..._28.job | 18 | 342/500 | 379/1000 | 29/34 | 20/23 | Submitted: 7822924 |
| run_09_mer..._29.job | 18 | 360/500 | 397/1000 | 30/34 | 21/23 | Submitted: 7822925 |
| run_09_mer..._30.job | 18 | 378/500 | 415/1000 | 31/34 | 22/23 | Submitted: 7822926 |
| run_09_mer..._31.job | 18 | 396/500 | 433/1000 | 32/34 | 22/23 | Submitted: 7822932 |
| run_09_mer..._32.job | 18 | 396/500 | 396/1000 | 33/34 | 23/23 | Wait 5 min |
| run_09_mer..._32.job | 18 | 359/500 | 381/1000 | 33/34 | 19/23 | Submitted: 7822959 |
| run_09_mer..._33.job | 16 | 341/500 | 399/1000 | 34/34 | 19/23 | Submitted: 7822961 |
--------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 7822670 7822671 7822673 7822749 7822832 7822835 7822837 7822840 7822842 7822844 7822846 7822849 7822876 7822877 7822880 7822882 7822900 7822902 7822903 7822904 7822906 7822909 7822913 7822917 7822919 7822920 7822922 7822923 7822924 7822925 7822926 7822932 7822959 7822961
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 6 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 6 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 17 COMPLETED, 4 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 18 COMPLETED, 3 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 20 COMPLETED, 1 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 11 RUNNING, 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 25 COMPLETED, 8 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 32 COMPLETED, 1 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
sacct -j 7822846
JobID JobName Partition Account AllocCPUS State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
7822846 run_09_me+ skx-normal tg-ear200+ 96 NODE_FAIL 0:0
Another error that I am observing for the first time. The re-submission because of timeout worked well, but it still says sacct error
. As this is the first time that I see this no action is needed, but maybe you have an idea.
| run_08_gen..._59.job | 9 | 36/500 | 386/1000 | 60/69 | 21/23 | Submitted: 7829206 |
| run_08_gen..._60.job | 9 | 45/500 | 376/1000 | 61/69 | 23/23 | Wait 5 min |
| run_08_gen..._60.job | 9 | 45/500 | 357/1000 | 61/69 | 23/23 | Wait 5 min |
| run_08_gen..._60.job | 9 | 45/500 | 375/1000 | 61/69 | 22/23 | Submitted: 7829229 |
| run_08_gen..._61.job | 9 | 54/500 | 384/1000 | 62/69 | 23/23 | Wait 5 min |
| run_08_gen..._61.job | 9 | 45/500 | 388/1000 | 62/69 | 23/23 | Wait 5 min |
| run_08_gen..._61.job | 9 | 45/500 | 388/1000 | 62/69 | 23/23 | Wait 5 min |
| run_08_gen..._61.job | 9 | 45/500 | 369/1000 | 62/69 | 23/23 | Wait 5 min |
| run_08_gen..._61.job | 9 | 45/500 | 312/1000 | 62/69 | 20/23 | Submitted: 7829374 |
| run_08_gen..._62.job | 9 | 54/500 | 340/1000 | 63/69 | 23/23 | Wait 5 min |
| run_08_gen..._62.job | 9 | 45/500 | 263/1000 | 63/69 | 17/23 | Submitted: 7829402 |
| run_08_gen..._63.job | 9 | 54/500 | 291/1000 | 64/69 | 19/23 | Submitted: 7829405 |
| run_08_gen..._64.job | 9 | 63/500 | 319/1000 | 65/69 | 23/23 | Wait 5 min |
| run_08_gen..._64.job | 9 | 45/500 | 207/1000 | 65/69 | 13/23 | Submitted: 7829428 |
| run_08_gen..._65.job | 9 | 54/500 | 216/1000 | 66/69 | 14/23 | Submitted: 7829429 |
| run_08_gen..._66.job | 9 | 63/500 | 244/1000 | 67/69 | 17/23 | Submitted: 7829433 |
| run_08_gen..._67.job | 9 | 72/500 | 272/1000 | 68/69 | 20/23 | Submitted: 7829436 |
| run_08_gen..._68.job | 2 | 81/500 | 319/1000 | 69/69 | 23/23 | Wait 5 min |
| run_08_gen..._68.job | 2 | 63/500 | 139/1000 | 69/69 | 6/23 | Submitted: 7829468 |
--------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 7827208 7827210 7827211 7827212 7827213 7827215 7827227 7827231 7827238 7827261 7827301 7827479 7827511 7827523 7827563 7827578 7827585 7827589 7827626 7827637 7827654 7827657 7827678 7827683 7827687 7827695 7827712 7827730 7827748 7827763 7827811 7827828 7827837 7827852 7827867 7827874 7827879 7827889 7827904 7827911 7827923 7827932 7827949 7827962 7827965 7827970 7828008 7828011 7828179 7828283 7828359 7828365 7828388 7828470 7828472 7828476 7828719 7829058 7829086 7829206 7829229 7829374 7829402 7829405 7829428 7829429 7829433 7829436 7829468
Timedout with walltime of 0:14:16.
Resubmitting file (/scratch/05861/tg851601/KokoxiliChunk37SenDT150/run_files/run_08_generate_burst_igram_5.job) with new walltime of 00:17:07
Resubmitted as jobumber: 7829480 7829498 7829517 7829521 7829527 7829533 7829538 7829544 7829602 7829606 7829609.
sacct: error: Unknown arguments:
sacct: error: 7829498
sacct: error: 7829517
sacct: error: 7829521
sacct: error: 7829527
sacct: error: 7829533
sacct: error: 7829538
sacct: error: 7829544
sacct: error: 7829602
sacct: error: 7829606
sacct: error: 7829609
KokoxiliChunk37SenDT150, run_08_generate_burst_igram, 69 jobs: 68 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING .
sacct: error: Unknown arguments:
sacct: error: 7829498
sacct: error: 7829517
sacct: error: 7829521
sacct: error: 7829527
sacct: error: 7829533
sacct: error: 7829538
sacct: error: 7829544
sacct: error: 7829602
sacct: error: 7829606
sacct: error: 7829609
Here is a problem which might have caused the earlier failure.
It submits 34 jobs but is waiting forever for the last job. It turns out that one job has status NODE_FAIL. So we should re-submit for this case in a similar way as for TIMEOUT.
Ok. Fixed this. For future reference, this is the list of JOB_STATE_CODEs that SLURM can return (https://slurm.schedmd.com/squeue.html). We support 7 right now.
Another error that I am observing for the first time. The re-submission because of timeout worked well, but it still says
sacct error
. As this is the first time that I see this no action is needed, but maybe you have an idea.
Possible I introduced this after finding a bug in the timeout resubmission code. I have limited datasets to test on right now, so some edge cases might have slipped through. However, based on the table output, you're still not using the most up to date copy of sbatch_conditional
and sbatch_minsar
I ran 7 workflows, three of which stopped prematurely (41,38,36). I don't see any systematics on why they might have stopped.
-rw-rw---- 1 tg851601 G-820134 6005 Jun 7 14:20 KokoxiliChunk41SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 71620 Jun 7 22:50 KokoxiliChunk38SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 114034 Jun 7 22:55 KokoxiliChunk36SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 158629 Jun 8 01:53 KokoxiliChunk35SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 158515 Jun 8 01:57 KokoxiliChunk40SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 162118 Jun 8 02:18 KokoxiliChunk39SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 173522 Jun 8 02:20 KokoxiliChunk37SenDT150/process0.log
KokoxiliChunk36SenDT150
/scratch/05861/tg851601/KokoxiliChunk36SenDT150/run_files/run_10_filter_coherence_33.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
| File Name | Extra Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message |
---------------------------------------------------------------------------------------------------------------------------------------------------------
| run_10_fil...e_0.job | 18 | 0/500 | 299/1000 | 1/34 | 23/25 | Submitted: 7855415 |
| run_10_fil...e_1.job | 18 | 37/500 | 396/1000 | 2/34 | 27/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_10_fil...e_1.job | 18 | 18/500 | 428/1000 | 2/34 | 27/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_10_fil...e_1.job | 18 | 18/500 | 325/1000 | 2/34 | 12/25 | Submitted: 7855541 |
| run_10_fil...e_2.job | 18 | 36/500 | 284/1000 | 3/34 | 20/25 | Submitted: 7855558 |
| run_10_fil...e_3.job | 18 | 73/500 | 392/1000 | 4/34 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_10_fil...e_3.job | 18 | 18/500 | 294/1000 | 4/34 | 19/25 | Submitted: 7855593 |
KokoxiliChunk38SenDT150
tail -20 process0.log
| | | | | | | Wait 5 minutes. |
| run_08_gen..._58.job | 9 | 63/500 | 316/1000 | 59/69 | 22/25 | Submitted: 7855317 |
| run_08_gen..._59.job | 9 | 74/500 | 378/1000 | 60/69 | 27/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_08_gen..._59.job | 9 | 65/500 | 352/1000 | 60/69 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_08_gen..._59.job | 9 | 0/500 | 179/1000 | 60/69 | 12/25 | Submitted: 7855377 |
| run_08_gen..._60.job | 9 | 9/500 | 247/1000 | 61/69 | 16/25 | Submitted: 7855388 |
| run_08_gen..._61.job | 9 | 18/500 | 324/1000 | 62/69 | 16/25 | Submitted: 7855398 |
| run_08_gen..._62.job | 9 | 27/500 | 333/1000 | 63/69 | 21/25 | Submitted: 7855409 |
| run_08_gen..._63.job | 9 | 36/500 | 359/1000 | 64/69 | 27/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_08_gen..._63.job | 9 | 36/500 | 428/1000 | 64/69 | 27/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_08_gen..._63.job | 9 | 18/500 | 359/1000 | 64/69 | 12/25 | Submitted: 7855534 |
| run_08_gen..._64.job | 9 | 27/500 | 232/1000 | 65/69 | 18/25 | Submitted: 7855549
KokoxiliChunk41SenDT150
tail -20 process0.log
| run_07_mer...c_6.job | 9 | 196/500 | 187/1000 | 7/15 | 20/25 | Submitted: 7851408 |
| run_07_mer...c_7.job | 9 | 213/500 | 205/1000 | 8/15 | 23/25 | Submitted: 7851416 |
| run_07_mer...c_8.job | 9 | 240/500 | 240/1000 | 9/15 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_07_mer...c_8.job | 9 | 206/500 | 233/1000 | 9/15 | 17/25 | Submitted: 7851460 |
| run_07_mer...c_9.job | 9 | 224/500 | 260/1000 | 10/15 | 19/25 | Submitted: 7851469 |
| run_07_mer..._10.job | 9 | 233/500 | 269/1000 | 11/15 | 22/25 | Submitted: 7851478 |
| run_07_mer..._11.job | 9 | 233/500 | 278/1000 | 12/15 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_07_mer..._11.job | 9 | 216/500 | 225/1000 | 12/15 | 24/25 | Submitted: |
| run_07_mer..._12.job | 9 | 225/500 | 234/1000 | 13/15 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes. |
| run_07_mer..._12.job | 9 | 117/500 | 243/1000 | 13/15 | 18/25 | Submitted: 7851591 |
| run_07_mer..._13.job | 9 | 153/500 | 288/1000 | 14/15 | 22/25 | Submitted: 7851601 |
| run_07_mer..._14.job | 3 | 189/500 | 333/1000 | 15/15 | 25/25 | Not submitted. |
| | | | | | | Max job count exceeded. |
| | | | | | | Wait 5 minutes.
I believe this is all fixed now.
While tracking down unexplained failures I noticed that a job did not get submitted. It skipped run_07*_17.job:
Do you have an explanation or is this a rogue failure? This might explain most of my unexplainable failures in the last weeks.
If a failure occurs it should try to resubmit a few times and if it is still not successful it should raise an exception and exit.
Two things to consider:
sbatch
command and capture and analyzed the error When 25 jobs are already running and you execute sbatch it tells you:No reservation for this job --> Verifying valid submit host (login3)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...OK --> Verifying availability of your home dir (/home1/05861/tg851601)...OK --> Verifying availability of your work dir (/work/05861/tg851601/stampede2)...OK --> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK --> Verifying valid ssh keys...OK --> Verifying access to desired queue (skx-normal)...OK --> Verifying job request is within current queue limits...FAILED
idev detected an error in your resource request (see details above): Here is the command you executed: /bin/idev -p skx-normal -N 1 -n 48
| run_09_mer..._33.job | 17 | 136/500 | 301/1000 | 34/37 | 20/25 | Submitted: 7737332 | | run_09_mer..._34.job | 17 | 153/500 | 348/1000 | 35/37 | 23/25 | sbatch message: sbatch submit error: exit code 1. Sleep 60 seconds and try again sbatch message: sbatch submit error: exit code 1. Exiting with status code 1. Submitted: | | run_09_mer..._35.job | 17 | 153/500 | 369/1000 | 36/37 | 25/25 | Wait 5 min | | run_09_mer..._35.job | 17 | 153/500 | 369/1000 | 36/37 | 23/25 | Submitted: 7737357 | | run_09_mer..._36.job | 2 | 153/500 | 369/1000 | 37/37 | 24/25 | Submitted: 7737362 |
Jobs submitted: 7737104 7737105 7737106 7737109 7737110 7737111 7737112 7737113 7737114 7737115 7737116 7737117 7737118 7737119 7737120 7737121 7737123 7737124 7737125 7737127 7737130 7737132 7737134 7737160 7737210 7737242 7737245 7737247 7737280 7737294 7737298 7737315 7737327 7737332 7737357 7737362 Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_4.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737366. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_11.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737369. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_24.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737395. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_25.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737413. KokoxiliBigChunk38SenAT114, run_09_merge_burst_igram, 36 jobs: 24 COMPLETED, 3 RUNNING , 8 PENDING , 1 WAITING . KokoxiliBigChunk38SenAT114, run_09_merge_burst_igram, 36 jobs: 24 COMPLETED, 3 RUNNING , 9 PENDING , 0 WAITING ```
########################
. Another case:
########################
########################
. Another case:
########################