Closed jsteel44 closed 4 years ago
I think we need to get the dacctl logs relating to what happens when you cancel the job, and ideally the logs from the dacd who is the primary brick host for that buffer.
I suspect its because the copy out failure is blocking the delete of the buffer inside the DAC, because slurm isn't passing the the "skip copy out" flag. But its a total guess. It might be we can tell Slurm (or the DAC) to do skip the copy out for this particular case. But need more info.
It seems now that stage-in failures can be cancelled without any problem; it is just the stage-out failures that cannot be cancelled. When I issue an scancel, nothing is printed to the dacctl.log or slurmctld.log.
For jobs that fail that have a stage out process, it seems we need to cancel these with scancel -H:
-H, --hurry
Do not stage out any burst buffer data.
After issuing that, buffers are torn down and jobs removed from the queue. A pretty simple oversight...
Job 556 references a file that does not exist for stage in, and 557 references a file that does not exist for stage out.
scancel 556
and 557 does nothing. The buffers also hang around but that is probably expected: