Cannot cancel failed jobs due to stage in/out failures

RSE-Cambridge / data-acc

Data Accelerator: Creates a burst buffer from generic hardware and integrates it with Slurm https://www.hpc.cam.ac.uk/research/data-acc http://www.stackhpc.com

https://rse-cambridge.github.io/data-acc

Apache License 2.0

17 stars 11 forks source link

Cannot cancel failed jobs due to stage in/out failures #112

Closed jsteel44 closed 4 years ago

jsteel44 commented 5 years ago

Job 556 references a file that does not exist for stage in, and 557 references a file that does not exist for stage out.

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               556     debug use-mult    test1 SO       0:10      1 (burst_buffer/datawarp: dws_data_out: exit status 23
)
               557     debug use-mult    test1 SO       0:09      1 (burst_buffer/datawarp: dws_data_out: exit status 23
)

scancel 556 and 557 does nothing. The buffers also hang around but that is probably expected:

JobID=557 CreateTime=2019-10-02T10:10:38 Pool=default Size=3200GiB State=staged-in UserID=test1(1001)
JobID=556 CreateTime=2019-10-02T10:07:44 Pool=default Size=3200GiB State=staged-in UserID=test1(1001)

JohnGarbutt commented 5 years ago

I think we need to get the dacctl logs relating to what happens when you cancel the job, and ideally the logs from the dacd who is the primary brick host for that buffer.

I suspect its because the copy out failure is blocking the delete of the buffer inside the DAC, because slurm isn't passing the the "skip copy out" flag. But its a total guess. It might be we can tell Slurm (or the DAC) to do skip the copy out for this particular case. But need more info.

jsteel44 commented 5 years ago

It seems now that stage-in failures can be cancelled without any problem; it is just the stage-out failures that cannot be cancelled. When I issue an scancel, nothing is printed to the dacctl.log or slurmctld.log.

jsteel44 commented 4 years ago

For jobs that fail that have a stage out process, it seems we need to cancel these with scancel -H:

       -H, --hurry
              Do not stage out any burst buffer data.

After issuing that, buffers are torn down and jobs removed from the queue. A pretty simple oversight...