DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
113 stars 176 forks source link

[8.0] Proper handling of waiting jobs when set to be killed #7690

Closed michmx closed 3 months ago

michmx commented 4 months ago

When jobs are in status SUBMITTING, WAITING, etc, and they are set to be killed, they are not added to the list markKilledJobList. This pull request fix the identation when kill instead of delete is used.

BEGINRELEASENOTES

*WMS FIX: Proper killing of jobs when not matched, running or stalled

ENDRELEASENOTES

iueda commented 4 months ago

I think the current indentation is correct, for eg. SUBMITTING is not allowed to go to KILLED:

            SUBMITTING: State(0, [RECEIVED, CHECKING, DELETED], defState=SUBMITTING),  # initial state

WAITING is allowed to go to KILLED since #7276, so it should be moved from this list of job statuses which are not allowed to go to KILLED to that which can go to KILLED.

Or, can't the possible states transitions (https://github.com/DIRACGrid/DIRAC/blob/rel-v8r0/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py) be checked instead of these hard-coded list?

fstagni commented 4 months ago

I think @iueda is correct. For checking the state transitions, https://github.com/DIRACGrid/DIRAC/blob/5f268bd719d0b24459ab052464bcbb134bae4efa/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py#L132 (`ā€ˇfilterJobStateTransition) can be used.

michmx commented 4 months ago

@fstagni we have a fundamental question before moving on with this PR or a more elaborated implementation. When you have a job in Waiting status, is it meaningful to use sendKillCommand=True in killJob()?

I ask because with the current indentation, only statuses not hard-coded are appended in the markKilledJobList, which use sendKillCommand=False (this was the behavior on DIRAC v7r2 for waiting jobs).

Moving the Waiting status to line 520 would add them to killJobList, using sendKillCommand=True.

fstagni commented 4 months ago

sendKillCommand=True is effectively only meaningful when the job is already running. For any other state it does not make sense.

iueda commented 4 months ago

According to WorkloadManagementSystem/Client/JobStatus.py,

the states that can go to Killed are

The current code does not kill jobs in 'WAITING' state

Do we want a) to treat them all the same, eg. killJobList.append(jobID), for sendKillCommand=True or False doesn't matter for jobs not running,
or b) to do as follows?

If the latter (b), then we would need to keep the hard-coding.

fstagni commented 4 months ago

I would go for option "a".

michmx commented 3 months ago

I would go for option "a".

Sorry for the delay @fstagni @iueda . I am back on this issue. Now the new proposal uses filterJobStateTransition() to check if the job can go to killed, plus deleted if requested.

Also, all jobs to be killed (after the filtering) go to killJobList, it means, no usage of sendKillCommand=False

fstagni commented 3 months ago

I took the freedom to push for adding a unit test.

michmx commented 3 months ago

I took the freedom to push for adding a unit test.

Thanks a lot!

DIRACGridBot commented 3 months ago

Sweep summary

Sweep ran in https://github.com/DIRACGrid/DIRAC/actions/runs/10367347752

Failed: