AWGL / TSO500_post_processing

0 stars 0 forks source link

Script 2 job "crashed" and re-queued itself, causing app to crash #18

Closed SophieS9 closed 2 years ago

SophieS9 commented 2 years ago

Run 220202_A00748_0218_BHMWC2DRXY Sample 22M01519

Script 2 for this sample submitted as normal on node 5 and then stopped 2.5 hours later as can be seen from the err and out file timestamps. The err file is empty and the out file has no clear error message, but the app did not run to completion. Stopped during or after TrimFastq step (this step does appear to have finished):

-rw-rw-r--. 1 transfer transfer     0 Feb  3 16:31 22M01519_2_TSO500-176824-cs05.err
-rw-rw-r--. 1 transfer transfer 47042 Feb  3 19:05 22M01519_2_TSO500-176824-cs05.out

The job then appears to have resubmitted itself with the same job ID on node 10. This crashes immediately as the analysis directory for this sample already exists:

-rw-rw-r--. 1 transfer transfer     0 Feb  3 19:48 22M01519_2_TSO500-176824-cs10.err
-rw-rw-r--. 1 transfer transfer    62 Feb  3 19:48 22M01519_2_TSO500-176824-cs10.out

Checking the job accounting information on slurm shows the job failed, and then resubmitted and completed:

(base) [transfer@ch1 220202_A00748_0218_BHMWC2DRXY]$ sacct -j 176824
       JobID    JobName  Partition    Account  AllocCPUS      State ExitCode
------------ ---------- ---------- ---------- ---------- ---------- --------
176824       2_TSO500.+       high ***test_a+         24     FAILED     13:0
176824.batch      batch            ***test_a+         24     FAILED     13:0
176824.exte+     extern            ***test_a+         24  COMPLETED      0:0

ExitCode 13 = Broken pipe: write to pipe with no readers.

SophieS9 commented 2 years ago

Identical issue on run 220207_A00748_0220_BHMWC3DRXY with Sample 22M01708.

The job initially ran on node 5 and crashed after 10 minutes. Was then resubmitted on node 1, but failed as analysis directory already existed. Both jobs had the same job ID - 176967.

sacct -j showed an identical ExitCode (13) to the run above.

SophieS9 commented 2 years ago

Node cs05 fixed. Issue resolved!