geodesymiami / rsmas_insar

RSMAS InSAR code
https://rsmas-insar.readthedocs.io/
GNU General Public License v3.0
62 stars 23 forks source link

submit_jobs.bash bug: some jobs are not submitted #480

Closed falkamelung closed 3 years ago

falkamelung commented 3 years ago

While tracking down unexplained failures I noticed that a job did not get submitted. It skipped run_07*_17.job:

| run_07_mer..._14.job | 7                | 63/500            | 269/1000           | 15/23               | 19/25          | Submitted: 7713429   |
| run_07_mer..._15.job | 7                | 70/500            | 301/1000           | 16/23               | 21/25          | Submitted: 7713431   |
| run_07_mer..._16.job | 7                | 77/500            | 314/1000           | 17/23               | 22/25          | Submitted: 7713433   |
| run_07_mer..._17.job | 7                | 84/500            | 346/1000           | 18/23               | 24/25          | sbatch message: 
sbatch submit error: exit code 1. Sleep 60 seconds and try again
sbatch message: 
sbatch submit error: exit code 1. Exiting with status code 1.
Submitted:           |
| run_07_mer..._18.job | 7                | 84/500            | 371/1000           | 19/23               | 24/25          | Submitted: 7713437   |
| run_07_mer..._19.job | 7                | 91/500            | 378/1000           | 20/23               | 25/25          | Wait 5 min           |
| run_07_mer..._19.job | 7                | 49/500            | 230/1000           | 20/23               | 15/25          | Submitted: 7713449   |

Do you have an explanation or is this a rogue failure? This might explain most of my unexplainable failures in the last weeks.

If a failure occurs it should try to resubmit a few times and if it is still not successful it should raise an exception and exit.

Two things to consider:

No reservation for this job --> Verifying valid submit host (login3)...OK --> Verifying valid jobname...OK --> Enforcing max jobs per user...OK --> Verifying availability of your home dir (/home1/05861/tg851601)...OK --> Verifying availability of your work dir (/work/05861/tg851601/stampede2)...OK --> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK --> Verifying valid ssh keys...OK --> Verifying access to desired queue (skx-normal)...OK --> Verifying job request is within current queue limits...FAILED

[*] Too many simultaneous jobs in queue.
    --> Max job limits for skx-normal =  25 jobs

idev detected an error in your resource request (see details above): Here is the command you executed: /bin/idev -p skx-normal -N 1 -n 48


- At the same time we might want to have a `sbatch_minsar` very similar to `sbatch` but doing our additions (waiting instead of exiting when MAX_JOBS_PER_QUEUE is reached, same for number of steps and total tasks). It should produce an output very similar to `sbatch` which we would capture and process. This is somewhat similar to `sbatch_conditional`.

Is this something to consider?

###################################
Here another incident when sbatch failed (after rerunning run_09_mer..._34.job it finished fine)
###################################

| run_09_mer..._33.job | 17 | 136/500 | 301/1000 | 34/37 | 20/25 | Submitted: 7737332 | | run_09_mer..._34.job | 17 | 153/500 | 348/1000 | 35/37 | 23/25 | sbatch message: sbatch submit error: exit code 1. Sleep 60 seconds and try again sbatch message: sbatch submit error: exit code 1. Exiting with status code 1. Submitted: | | run_09_mer..._35.job | 17 | 153/500 | 369/1000 | 36/37 | 25/25 | Wait 5 min | | run_09_mer..._35.job | 17 | 153/500 | 369/1000 | 36/37 | 23/25 | Submitted: 7737357 | | run_09_mer..._36.job | 2 | 153/500 | 369/1000 | 37/37 | 24/25 | Submitted: 7737362 |

Jobs submitted: 7737104 7737105 7737106 7737109 7737110 7737111 7737112 7737113 7737114 7737115 7737116 7737117 7737118 7737119 7737120 7737121 7737123 7737124 7737125 7737127 7737130 7737132 7737134 7737160 7737210 7737242 7737245 7737247 7737280 7737294 7737298 7737315 7737327 7737332 7737357 7737362 Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_4.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737366. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_11.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737369. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_24.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737395. Timedout with walltime of 0:08:32. Resubmitting file (/scratch/05861/tg851601/KokoxiliBigChunk38SenAT114/run_files/run_09_merge_burst_igram_25.job) with new walltime of 00:10:14 Resubmitted as jobumber: 7737413. KokoxiliBigChunk38SenAT114, run_09_merge_burst_igram, 36 jobs: 24 COMPLETED, 3 RUNNING , 8 PENDING , 1 WAITING . KokoxiliBigChunk38SenAT114, run_09_merge_burst_igram, 36 jobs: 24 COMPLETED, 3 RUNNING , 9 PENDING , 0 WAITING ```

########################

. Another case:

########################

| run_10_fil...e_4.job | 18               | 72/500            | 303/1000           | 5/35                | 26/25          | Wait 5 min           |
| run_10_fil...e_4.job | 18               | 72/500            | 214/1000           | 5/35                | 20/25          | Submitted: 7739312   |
| run_10_fil...e_5.job | 18               | 90/500            | 239/1000           | 6/35                | 22/25          | Submitted: 7739314   |
| run_10_fil...e_6.job | 18               | 108/500           | 264/1000           | 7/35                | 24/25          | sbatch message: 
sbatch submit error: exit code 1. Sleep 60 seconds and try again
sbatch message: 
sbatch submit error: exit code 1. Exiting with status code 1.
Submitted:           |
| run_10_fil...e_7.job | 18               | 90/500            | 261/1000           | 8/35                | 22/25          | Submitted: 7739321   |
| run_10_fil...e_8.job | 18               | 90/500            | 243/1000           | 9/35                | 23/25          | Submitted: 7739322   |
| run_10_fil...e_9.job | 18               | 108/500           | 254/1000           | 10/35               | 23/25          | Submitted: 7739324   |
| run_10_fil..._10.job | 18               | 126/500           | 272/1000           | 11/35               | 24/25          | Submitted: 7739326   |
| run_10_fil..._11.job | 18               | 144/500           | 283/1000           | 12/35               | 25/25          | Wait 5 min           |
| run_10_fil..._11.job | 18               | 54/500            | 201/1000           | 12/35               | 24/25          | Submitted: 7739345   |

########################

. Another case:

########################

| run_11_unwrap_20.job | 15               | 165/500           | 353/1000           | 21/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_20.job | 15               | 150/500           | 293/1000           | 21/39               | 23/25          | Submitted: 7741538   |
| run_11_unwrap_21.job | 15               | 165/500           | 317/1000           | 22/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_21.job | 15               | 165/500           | 340/1000           | 22/39               | 24/25          | sbatch message: 
sbatch submit error: exit code 1. Sleep 60 seconds and try again
sbatch message: 
sbatch submit error: exit code 1. Exiting with status code 1.
Submitted:           |
| run_11_unwrap_22.job | 15               | 165/500           | 340/1000           | 23/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_22.job | 15               | 165/500           | 340/1000           | 23/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_22.job | 15               | 165/500           | 340/1000           | 23/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_22.job | 15               | 165/500           | 340/1000           | 23/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_22.job | 15               | 165/500           | 340/1000           | 23/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_22.job | 15               | 150/500           | 325/1000           | 23/39               | 24/25          | Submitted: 7741594   |
| run_11_unwrap_23.job | 15               | 165/500           | 340/1000           | 24/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_23.job | 15               | 165/500           | 340/1000           | 24/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_23.job | 15               | 165/500           | 340/1000           | 24/39               | 20/25          | Submitted: 7741609   |
| run_11_unwrap_24.job | 15               | 105/500           | 340/1000           | 25/39               | 20/25          | Submitted: 7741610   |
| run_11_unwrap_25.job | 15               | 120/500           | 295/1000           | 26/39               | 21/25          | Submitted: 7741612   |
| run_11_unwrap_26.job | 15               | 135/500           | 310/1000           | 27/39               | 22/25          | Submitted: 7741614   |
| run_11_unwrap_27.job | 15               | 150/500           | 303/1000           | 28/39               | 24/25          | Submitted: 7741616   |
| run_11_unwrap_28.job | 15               | 165/500           | 336/1000           | 29/39               | 26/25          | Wait 3 min           |
| run_11_unwrap_28.job | 15               | 165/500           | 353/1000           | 29/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_28.job | 15               | 165/500           | 335/1000           | 29/39               | 24/25          | sbatch message: 
sbatch submit error: exit code 1. Sleep 60 seconds and try again
sbatch message: 
sbatch submit error: exit code 1. Exiting with status code 1.
Submitted:           |
| run_11_unwrap_29.job | 15               | 165/500           | 366/1000           | 30/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_29.job | 15               | 165/500           | 366/1000           | 30/39               | 25/25          | Wait 3 min           |
| run_11_unwrap_29.job | 15               | 75/500            | 369/1000           | 30/39               | 25/25          | Wait 3 min           |
falkamelung commented 3 years ago

I modified sbatch_conditinal to do some logging when this occurs. It appears an max job numbers per queue issue:

  run_08_gen..._30.job | 8                | 200/500           | 200/1000           | 31/80               | 25/25          | Wait 5 min           |
| run_08_gen..._30.job | 8                | 168/500           | 168/1000           | 31/80               | 18/25          | Submitted: 7763428   |
| run_08_gen..._31.job | 8                | 176/500           | 184/1000           | 32/80               | 20/25          | Submitted: 7763431   |
| run_08_gen..._32.job | 8                | 184/500           | 184/1000           | 33/80               | 22/25          | Submitted: 7763433   |
| run_08_gen..._33.job | 8                | 192/500           | 200/1000           | 34/80               | 23/25          | Submitted: 7763437   |
| run_08_gen..._34.job | 8                | 200/500           | 208/1000           | 35/80               | 24/25          | sbatch message: 
-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer                 
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED

    [*] Too many simultaneous jobs in queue.
        --> Max job limits for skx-normal =  25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted: 
falkamelung commented 3 years ago

KokoxiliChunk36SenAT12

/scratch/05861/tg851601/KokoxiliChunk36SenAT12/run_files/run_09_merge_burst_igram_37.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name            | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs    | Message              |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_09_mer...m_0.job | 17               | 0/500             | 221/1000           | 1/38                | 23/25          | Submitted: 7774596   |
| run_09_mer...m_1.job | 17               | 17/500            | 271/1000           | 2/38                | 25/25          | Wait 5 min           |
| run_09_mer...m_1.job | 17               | 17/500            | 239/1000           | 2/38                | 22/25          | Submitted: 7774612   |
| run_09_mer...m_2.job | 17               | 34/500            | 281/1000           | 3/38                | 25/25          | Wait 5 min           |
| run_09_mer...m_2.job | 17               | 34/500            | 256/1000           | 3/38                | 24/25          | Submitted: 7774627   |
| run_09_mer...m_3.job | 17               | 51/500            | 298/1000           | 4/38                | 26/25          | Wait 5 min           |
| run_09_mer...m_3.job | 17               | 51/500            | 273/1000           | 4/38                | 24/25          | sbatch message:
-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED

    [*] Too many simultaneous jobs in queue.
        --> Max job limits for skx-normal =  25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted:
falkamelung commented 3 years ago

KokoxiliChunk32SenAT12

/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_15.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_16.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_17.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_18.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_19.job
/scratch/05861/tg851601/KokoxiliChunk32SenAT12/run_files/run_07_merge_reference_secondary_slc_20.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name            | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs    | Message              |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_07_mer...c_0.job | 8                | 0/500             | 254/1000           | 1/21                | 22/25          | Submitted: 7774500   |
| run_07_mer...c_1.job | 8                | 8/500             | 279/1000           | 2/21                | 23/25          | Submitted: 7774502   |
| run_07_mer...c_2.job | 8                | 16/500            | 304/1000           | 3/21                | 25/25          | Wait 5 min           |
| run_07_mer...c_2.job | 8                | 8/500             | 176/1000           | 3/21                | 20/25          | Submitted: 7774532   |
| run_07_mer...c_3.job | 8                | 16/500            | 200/1000           | 4/21                | 24/25          | Submitted: 7774536   |
| run_07_mer...c_4.job | 8                | 24/500            | 241/1000           | 5/21                | 25/25          | Wait 5 min           |
| run_07_mer...c_4.job | 8                | 16/500            | 169/1000           | 5/21                | 18/25          | Submitted: 7774553   |
| run_07_mer...c_5.job | 8                | 24/500            | 208/1000           | 6/21                | 21/25          | Submitted: 7774557   |
| run_07_mer...c_6.job | 8                | 32/500            | 241/1000           | 7/21                | 24/25          | sbatch message:
-----------------------------------------------------------------
          Welcome to the Stampede2 Supercomputer
-----------------------------------------------------------------

No reservation for this job
--> Verifying valid submit host (login3)...OK
--> Verifying valid jobname...OK
--> Enforcing max jobs per user...OK
--> Verifying availability of your home dir (/home1/05861/tg851601)...OK
--> Verifying availability of your work2 dir (/work2/05861/tg851601/stampede2)...OK
--> Verifying availability of your scratch dir (/scratch/05861/tg851601)...OK
--> Verifying valid ssh keys...OK
--> Verifying access to desired queue (skx-normal)...OK
--> Verifying job request is within current queue limits...FAILED

    [*] Too many simultaneous jobs in queue.
        --> Max job limits for skx-normal =  25 jobs
sbatch submit error: exit code 1. Sleep 60 seconds and try again
Jobs submitted:
falkamelung commented 3 years ago

Hi @Ovec8hkin : Even for the skx-dev queue with number of job limit of 1 it shows occasionally 2/1 active jobs (although it does not fail). Would that make it not easy to debug why the number of active jobs is wrongly calculated? You could just run on skx-normal a few concurrent workflows with a job limits of 2 or 3.

KokoxiliChunk30SenDT150
| run_04_ful...r_0.job | 22               | 117/500           | 334/1000           | 1/7                 | 0/1            | Submitted: 7814671   |
| run_04_ful...r_1.job | 22               | 162/500           | 319/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 414/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 333/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 378/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 360/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 45/500            | 397/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 45/500            | 397/1000           | 2/7                 | 1/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 0/500             | 426/1000           | 2/7                 | 1/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 23/500            | 366/1000           | 2/7                 | 1/1            | Wait 5 min           |
Ovec8hkin commented 3 years ago

It's almost certainly a race condition. If the submission fails, but the offending file ultimately gets submitted successfully I don't see a big issue. I can try some debugging as you mentioned, but I doubt there's a true solution.

falkamelung commented 3 years ago

It would be good if you could try some more debugging. This script is key to all our processing. I much prefer to have it bug free as I need to run ~15 submit_jobs.bash workflows concurrently.

If you can't find it we could add after each run step a count of the run*.o files and compare with the number of submitted jobs.

I don't think a race condition can cause errors in counting. It could cause a failed job submission if two concurrent workflows have determined at the same time that there are less than 25 active jobs and try to submit.

Ovec8hkin commented 3 years ago

Hi @Ovec8hkin : Even for the skx-dev queue with number of job limit of 1 it shows occasionally 2/1 active jobs (although it does not fail). Would that make it not easy to debug why the number of active jobs is wrongly calculated? You could just run on skx-normal a few concurrent workflows with a job limits of 2 or 3.

KokoxiliChunk30SenDT150
| run_04_ful...r_0.job | 22               | 117/500           | 334/1000           | 1/7                 | 0/1            | Submitted: 7814671   |
| run_04_ful...r_1.job | 22               | 162/500           | 319/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 414/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 333/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 378/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 162/500           | 360/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 45/500            | 397/1000           | 2/7                 | 2/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 45/500            | 397/1000           | 2/7                 | 1/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 0/500             | 426/1000           | 2/7                 | 1/1            | Wait 5 min           |
| run_04_ful...r_1.job | 22               | 23/500            | 366/1000           | 2/7                 | 1/1            | Wait 5 min           |

Is this with the most updated copy of the codebase? It doesn't look like it, as I modified that table to report the the reason for failure (see below). I think that, after updating to sbatch_minsar and some minor structural changes to the control flow, I eliminated this type of error, which I still think is the result of some strange race condition, as I cant replicated it in any capacity.

The new sbatch_conditional table looks as follows. The 2/1 reporting is indicating that the file being submitted is number 2, with a maximum of 1 allowed. It subsequently failed as expected. If 1/1 was reported, the file would submit successfully (see final line of table), as it would be submission 1 of 1. You can always think of the table as reporting the resource status AFTER the job in question was attempted to be submitted. I can change so as to reflect the status BEFORE the job was submitted (ie. what the actual checks report), but this way make more sense to me.

---------------------------------------------------------------------------------------------------------------------------------------------------------
| File Name            | Extra Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message                             |  
---------------------------------------------------------------------------------------------------------------------------------------------------------
| run_01_unp...e_0.job | 1           | 2/1500            | 2/3000             | 1/1                 | 2/1         | Submission failed.                  |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5 minutes.                     |
| run_01_unp...e_0.job | 1           | 2/1500            | 2/3000             | 1/1                 | 2/1         | Submission failed.                  |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5 minutes.                     |
| run_01_unp...e_0.job | 1           | 2/1500            | 2/3000             | 1/1                 | 2/1         | Submission failed.                  |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5 minutes.                     |
| run_01_unp...e_0.job | 1           | 2/1500            | 2/3000             | 1/1                 | 2/1         | Submission failed.                  |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5 minutes.                     |
| run_01_unp...e_0.job | 1           | 1/1500            | 1/3000             | 1/1                 | 1/1         | Submitted: 7818379                  |
---------------------------------------------------------------------------------------------------------------------------------------------------------
falkamelung commented 3 years ago

No, that was the old version. I just posted this as it shows that a smart way to debug would be to have a job limit of e.g. 3

Ovec8hkin commented 3 years ago

I mean, I can't do anything without an actual debug output. You need to post a real one for me to see what's going on.

falkamelung commented 3 years ago

Here is a problem which might have caused the earlier failure.

It submits 34 jobs but is waiting forever for the last job. It turns out that one job has status NODE_FAIL. So we should re-submit for this case in a similar way as for TIMEOUT.

no known data problem found

Jobfiles to run:
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_0.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_1.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_2.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_3.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_4.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_5.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_6.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_7.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_8.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_9.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_10.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_11.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_12.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_13.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_14.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_15.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_16.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_17.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_18.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_19.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_20.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_21.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_22.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_23.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_24.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_25.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_26.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_27.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_28.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_29.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_30.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_31.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_32.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_33.job
/scratch/05861/tg851601/KokoxiliChunk32SenDT150/run_files/run_09_merge_burst_igram_33.job
--------------------------------------------------------------------------------------------------------------------------------------------------
| File Name            | Additional Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs    | Message              |
--------------------------------------------------------------------------------------------------------------------------------------------------
| run_09_mer...m_0.job | 18               | 340/500           | 386/1000           | 1/34                | 20/23          | Submitted: 7822670   |
| run_09_mer...m_1.job | 18               | 358/500           | 395/1000           | 2/34                | 21/23          | Submitted: 7822671   |
| run_09_mer...m_2.job | 18               | 376/500           | 413/1000           | 3/34                | 22/23          | Submitted: 7822673   |
| run_09_mer...m_3.job | 18               | 394/500           | 431/1000           | 4/34                | 23/23          | Wait 5 min           |
| run_09_mer...m_3.job | 18               | 394/500           | 394/1000           | 4/34                | 23/23          | Wait 5 min           |
| run_09_mer...m_3.job | 18               | 377/500           | 360/1000           | 4/34                | 22/23          | Submitted: 7822749   |
| run_09_mer...m_4.job | 18               | 395/500           | 395/1000           | 5/34                | 23/23          | Wait 5 min           |
| run_09_mer...m_4.job | 18               | 121/500           | 160/1000           | 5/34                | 8/23           | Submitted: 7822832   |
| run_09_mer...m_5.job | 18               | 156/500           | 195/1000           | 6/34                | 10/23          | Submitted: 7822835   |
| run_09_mer...m_6.job | 18               | 191/500           | 230/1000           | 7/34                | 11/23          | Submitted: 7822837   |
| run_09_mer...m_7.job | 18               | 226/500           | 265/1000           | 8/34                | 13/23          | Submitted: 7822840   |
| run_09_mer...m_8.job | 18               | 261/500           | 300/1000           | 9/34                | 15/23          | Submitted: 7822842   |
| run_09_mer...m_9.job | 18               | 296/500           | 335/1000           | 10/34               | 17/23          | Submitted: 7822844   |
| run_09_mer..._10.job | 18               | 331/500           | 370/1000           | 11/34               | 19/23          | Submitted: 7822846   |
| run_09_mer..._11.job | 18               | 366/500           | 366/1000           | 12/34               | 21/23          | Submitted: 7822849   |
| run_09_mer..._12.job | 18               | 401/500           | 401/1000           | 13/34               | 23/23          | Wait 5 min           |
| run_09_mer..._12.job | 18               | 315/500           | 354/1000           | 13/34               | 18/23          | Submitted: 7822876   |
| run_09_mer..._13.job | 18               | 350/500           | 389/1000           | 14/34               | 20/23          | Submitted: 7822877   |
| run_09_mer..._14.job | 18               | 385/500           | 424/1000           | 15/34               | 21/23          | Submitted: 7822880   |
| run_09_mer..._15.job | 18               | 405/500           | 403/1000           | 16/34               | 22/23          | Submitted: 7822882   |
| run_09_mer..._16.job | 18               | 405/500           | 405/1000           | 17/34               | 23/23          | Wait 5 min           |
| run_09_mer..._16.job | 18               | 267/500           | 303/1000           | 17/34               | 8/23           | Submitted: 7822900   |
| run_09_mer..._17.job | 18               | 196/500           | 249/1000           | 18/34               | 9/23           | Submitted: 7822902   |
| run_09_mer..._18.job | 18               | 162/500           | 162/1000           | 19/34               | 10/23          | Submitted: 7822903   |
| run_09_mer..._19.job | 18               | 180/500           | 180/1000           | 20/34               | 11/23          | Submitted: 7822904   |
| run_09_mer..._20.job | 18               | 198/500           | 198/1000           | 21/34               | 12/23          | Submitted: 7822906   |
| run_09_mer..._21.job | 18               | 216/500           | 216/1000           | 22/34               | 13/23          | Submitted: 7822909   |
| run_09_mer..._22.job | 18               | 234/500           | 271/1000           | 23/34               | 14/23          | Submitted: 7822913   |
| run_09_mer..._23.job | 18               | 252/500           | 289/1000           | 24/34               | 15/23          | Submitted: 7822917   |
| run_09_mer..._24.job | 18               | 270/500           | 307/1000           | 25/34               | 16/23          | Submitted: 7822919   |
| run_09_mer..._25.job | 18               | 288/500           | 325/1000           | 26/34               | 17/23          | Submitted: 7822920   |
| run_09_mer..._26.job | 18               | 306/500           | 343/1000           | 27/34               | 18/23          | Submitted: 7822922   |
| run_09_mer..._27.job | 18               | 324/500           | 361/1000           | 28/34               | 19/23          | Submitted: 7822923   |
| run_09_mer..._28.job | 18               | 342/500           | 379/1000           | 29/34               | 20/23          | Submitted: 7822924   |
| run_09_mer..._29.job | 18               | 360/500           | 397/1000           | 30/34               | 21/23          | Submitted: 7822925   |
| run_09_mer..._30.job | 18               | 378/500           | 415/1000           | 31/34               | 22/23          | Submitted: 7822926   |
| run_09_mer..._31.job | 18               | 396/500           | 433/1000           | 32/34               | 22/23          | Submitted: 7822932   |
| run_09_mer..._32.job | 18               | 396/500           | 396/1000           | 33/34               | 23/23          | Wait 5 min           |
| run_09_mer..._32.job | 18               | 359/500           | 381/1000           | 33/34               | 19/23          | Submitted: 7822959   |
| run_09_mer..._33.job | 16               | 341/500           | 399/1000           | 34/34               | 19/23          | Submitted: 7822961   |
--------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 7822670 7822671 7822673 7822749 7822832 7822835 7822837 7822840 7822842 7822844 7822846 7822849 7822876 7822877 7822880 7822882 7822900 7822902 7822903 7822904 7822906 7822909 7822913 7822917 7822919 7822920 7822922 7822923 7822924 7822925 7822926 7822932 7822959 7822961
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 1 RUNNING , 17 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 6 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 15 COMPLETED, 6 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 16 COMPLETED, 5 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 17 COMPLETED, 4 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 18 COMPLETED, 3 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 20 COMPLETED, 1 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 0 RUNNING , 12 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 21 COMPLETED, 1 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 0 RUNNING , 11 PENDING, 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 22 COMPLETED, 11 RUNNING, 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 23 COMPLETED, 10 RUNNING, 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 25 COMPLETED, 8 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 32 COMPLETED, 1 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
KokoxiliChunk32SenDT150, run_09_merge_burst_igram, 34 jobs: 33 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
sacct -j  7822846
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode 
------------ ---------- ---------- ---------- ---------- ---------- -------- 
7822846      run_09_me+ skx-normal tg-ear200+         96  NODE_FAIL      0:0 
falkamelung commented 3 years ago

Another error that I am observing for the first time. The re-submission because of timeout worked well, but it still says sacct error. As this is the first time that I see this no action is needed, but maybe you have an idea.

| run_08_gen..._59.job | 9                | 36/500            | 386/1000           | 60/69               | 21/23          | Submitted: 7829206   |
| run_08_gen..._60.job | 9                | 45/500            | 376/1000           | 61/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._60.job | 9                | 45/500            | 357/1000           | 61/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._60.job | 9                | 45/500            | 375/1000           | 61/69               | 22/23          | Submitted: 7829229   |
| run_08_gen..._61.job | 9                | 54/500            | 384/1000           | 62/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._61.job | 9                | 45/500            | 388/1000           | 62/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._61.job | 9                | 45/500            | 388/1000           | 62/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._61.job | 9                | 45/500            | 369/1000           | 62/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._61.job | 9                | 45/500            | 312/1000           | 62/69               | 20/23          | Submitted: 7829374   |
| run_08_gen..._62.job | 9                | 54/500            | 340/1000           | 63/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._62.job | 9                | 45/500            | 263/1000           | 63/69               | 17/23          | Submitted: 7829402   |
| run_08_gen..._63.job | 9                | 54/500            | 291/1000           | 64/69               | 19/23          | Submitted: 7829405   |
| run_08_gen..._64.job | 9                | 63/500            | 319/1000           | 65/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._64.job | 9                | 45/500            | 207/1000           | 65/69               | 13/23          | Submitted: 7829428   |
| run_08_gen..._65.job | 9                | 54/500            | 216/1000           | 66/69               | 14/23          | Submitted: 7829429   |
| run_08_gen..._66.job | 9                | 63/500            | 244/1000           | 67/69               | 17/23          | Submitted: 7829433   |
| run_08_gen..._67.job | 9                | 72/500            | 272/1000           | 68/69               | 20/23          | Submitted: 7829436   |
| run_08_gen..._68.job | 2                | 81/500            | 319/1000           | 69/69               | 23/23          | Wait 5 min           |     
| run_08_gen..._68.job | 2                | 63/500            | 139/1000           | 69/69               | 6/23           | Submitted: 7829468   |
--------------------------------------------------------------------------------------------------------------------------------------------------
Jobs submitted: 7827208 7827210 7827211 7827212 7827213 7827215 7827227 7827231 7827238 7827261 7827301 7827479 7827511 7827523 7827563 7827578 7827585 7827589 7827626 7827637 7827654 7827657 7827678 7827683 7827687 7827695 7827712 7827730 7827748 7827763 7827811 7827828 7827837 7827852 7827867 7827874 7827879 7827889 7827904 7827911 7827923 7827932 7827949 7827962 7827965 7827970 7828008 7828011 7828179 7828283 7828359 7828365 7828388 7828470 7828472 7828476 7828719 7829058 7829086 7829206 7829229 7829374 7829402 7829405 7829428 7829429 7829433 7829436 7829468
Timedout with walltime of 0:14:16.
Resubmitting file (/scratch/05861/tg851601/KokoxiliChunk37SenDT150/run_files/run_08_generate_burst_igram_5.job) with new walltime of 00:17:07
Resubmitted as jobumber: 7829480 7829498 7829517 7829521 7829527 7829533 7829538 7829544 7829602 7829606 7829609.
sacct: error: Unknown arguments:
sacct: error:  7829498
sacct: error:  7829517
sacct: error:  7829521
sacct: error:  7829527
sacct: error:  7829533
sacct: error:  7829538
sacct: error:  7829544
sacct: error:  7829602
sacct: error:  7829606
sacct: error:  7829609
KokoxiliChunk37SenDT150, run_08_generate_burst_igram, 69 jobs: 68 COMPLETED, 0 RUNNING , 0 PENDING , 1 WAITING   .
sacct: error: Unknown arguments:
sacct: error:  7829498
sacct: error:  7829517
sacct: error:  7829521
sacct: error:  7829527
sacct: error:  7829533
sacct: error:  7829538
sacct: error:  7829544
sacct: error:  7829602
sacct: error:  7829606
sacct: error:  7829609
Ovec8hkin commented 3 years ago

Here is a problem which might have caused the earlier failure.

It submits 34 jobs but is waiting forever for the last job. It turns out that one job has status NODE_FAIL. So we should re-submit for this case in a similar way as for TIMEOUT.

Ok. Fixed this. For future reference, this is the list of JOB_STATE_CODEs that SLURM can return (https://slurm.schedmd.com/squeue.html). We support 7 right now.

Ovec8hkin commented 3 years ago

Another error that I am observing for the first time. The re-submission because of timeout worked well, but it still says sacct error. As this is the first time that I see this no action is needed, but maybe you have an idea.

Possible I introduced this after finding a bug in the timeout resubmission code. I have limited datasets to test on right now, so some edge cases might have slipped through. However, based on the table output, you're still not using the most up to date copy of sbatch_conditional and sbatch_minsar

falkamelung commented 3 years ago

I ran 7 workflows, three of which stopped prematurely (41,38,36). I don't see any systematics on why they might have stopped.

-rw-rw---- 1 tg851601 G-820134   6005 Jun  7 14:20 KokoxiliChunk41SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134  71620 Jun  7 22:50 KokoxiliChunk38SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 114034 Jun  7 22:55 KokoxiliChunk36SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 158629 Jun  8 01:53 KokoxiliChunk35SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 158515 Jun  8 01:57 KokoxiliChunk40SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 162118 Jun  8 02:18 KokoxiliChunk39SenDT150/process0.log
-rw-rw---- 1 tg851601 G-820134 173522 Jun  8 02:20 KokoxiliChunk37SenDT150/process0.log

KokoxiliChunk36SenDT150

/scratch/05861/tg851601/KokoxiliChunk36SenDT150/run_files/run_10_filter_coherence_33.job
---------------------------------------------------------------------------------------------------------------------------------------------------------
| File Name            | Extra Tasks | Step Active Tasks | Total Active Tasks | Step Processed Jobs | Active Jobs | Message                             |  
---------------------------------------------------------------------------------------------------------------------------------------------------------
| run_10_fil...e_0.job | 18          | 0/500             | 299/1000           | 1/34                | 23/25       | Submitted: 7855415                  |
| run_10_fil...e_1.job | 18          | 37/500            | 396/1000           | 2/34                | 27/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_10_fil...e_1.job | 18          | 18/500            | 428/1000           | 2/34                | 27/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_10_fil...e_1.job | 18          | 18/500            | 325/1000           | 2/34                | 12/25       | Submitted: 7855541                  |
| run_10_fil...e_2.job | 18          | 36/500            | 284/1000           | 3/34                | 20/25       | Submitted: 7855558                  |
| run_10_fil...e_3.job | 18          | 73/500            | 392/1000           | 4/34                | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_10_fil...e_3.job | 18          | 18/500            | 294/1000           | 4/34                | 19/25       | Submitted: 7855593                  |

KokoxiliChunk38SenDT150

tail -20 process0.log
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_08_gen..._58.job | 9           | 63/500            | 316/1000           | 59/69               | 22/25       | Submitted: 7855317                  |
| run_08_gen..._59.job | 9           | 74/500            | 378/1000           | 60/69               | 27/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_08_gen..._59.job | 9           | 65/500            | 352/1000           | 60/69               | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_08_gen..._59.job | 9           | 0/500             | 179/1000           | 60/69               | 12/25       | Submitted: 7855377                  |
| run_08_gen..._60.job | 9           | 9/500             | 247/1000           | 61/69               | 16/25       | Submitted: 7855388                  |
| run_08_gen..._61.job | 9           | 18/500            | 324/1000           | 62/69               | 16/25       | Submitted: 7855398                  |
| run_08_gen..._62.job | 9           | 27/500            | 333/1000           | 63/69               | 21/25       | Submitted: 7855409                  |
| run_08_gen..._63.job | 9           | 36/500            | 359/1000           | 64/69               | 27/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_08_gen..._63.job | 9           | 36/500            | 428/1000           | 64/69               | 27/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_08_gen..._63.job | 9           | 18/500            | 359/1000           | 64/69               | 12/25       | Submitted: 7855534                  |
| run_08_gen..._64.job | 9           | 27/500            | 232/1000           | 65/69               | 18/25       | Submitted: 7855549    

KokoxiliChunk41SenDT150

tail -20 process0.log
| run_07_mer...c_6.job | 9           | 196/500           | 187/1000           | 7/15                | 20/25       | Submitted: 7851408                  |
| run_07_mer...c_7.job | 9           | 213/500           | 205/1000           | 8/15                | 23/25       | Submitted: 7851416                  |
| run_07_mer...c_8.job | 9           | 240/500           | 240/1000           | 9/15                | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_07_mer...c_8.job | 9           | 206/500           | 233/1000           | 9/15                | 17/25       | Submitted: 7851460                  |
| run_07_mer...c_9.job | 9           | 224/500           | 260/1000           | 10/15               | 19/25       | Submitted: 7851469                  |
| run_07_mer..._10.job | 9           | 233/500           | 269/1000           | 11/15               | 22/25       | Submitted: 7851478                  |
| run_07_mer..._11.job | 9           | 233/500           | 278/1000           | 12/15               | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_07_mer..._11.job | 9           | 216/500           | 225/1000           | 12/15               | 24/25       | Submitted:                          |
| run_07_mer..._12.job | 9           | 225/500           | 234/1000           | 13/15               | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.                    |
| run_07_mer..._12.job | 9           | 117/500           | 243/1000           | 13/15               | 18/25       | Submitted: 7851591                  |
| run_07_mer..._13.job | 9           | 153/500           | 288/1000           | 14/15               | 22/25       | Submitted: 7851601                  |
| run_07_mer..._14.job | 3           | 189/500           | 333/1000           | 15/15               | 25/25       | Not submitted.                      |
|                      |             |                   |                    |                     |             | Max job count exceeded.             |
|                      |             |                   |                    |                     |             | Wait 5  minutes.         
Ovec8hkin commented 3 years ago

I believe this is all fixed now.