DIRACGrid / DIRAC

DIRAC Grid
http://diracgrid.org
GNU General Public License v3.0
114 stars 175 forks source link

Introducing 'Scouting' Status in WMS state transitions #7083

Open michmx opened 1 year ago

michmx commented 1 year ago

In our BelleDIRAC extension, we have defined Scout jobs as a small subset of the main jobs that run first, and the rest of the jobs are executed only when scouting is done. Main jobs get the status ‘Scouting’ while waiting for the execution of the subset. Scout jobs were presented in the DIRAC Users Workshop in 2021 (link to contribution).

From DIRAC 7.3, WMS changes the way that statuses change and now state transitions are defined: JobStatus.py#L82

When we migrated our system to DIRAC 7.3, we noticed error messages like

2023-03-09 18:20:12 UTC WorkloadManagement/OptimizationMind ERROR: There was a problem processing task 213431:
getNextState: 'Scouting' is not a valid state

Our Scouting state is very much similar to Staging in the sense that jobs stay in that state before Waiting until some conditions are fulfilled. The transitions that job states with scouting face are:

If you agree, we need to

1) Define the state “Scouting” at JOB_STATES: https://github.com/DIRACGrid/DIRAC/blob/rel-v7r3/src/DIRAC/WorkloadManagementSystem/Client/JobStatus.py#L48

2) Enable the transitions

    SCOUTING: State(2, [CHECKING, WAITING, FAILED, STALLED, KILLED], defState=SCOUTING),
    CHECKING: State(2, [SCOUTING, STAGING, WAITING, RESCHEDULED, FAILED, DELETED], defState=CHECKING),
    RECEIVED: State(1, [SCOUTING, CHECKING, WAITING, FAILED, DELETED], defState=RECEIVED),
fstagni commented 1 year ago

Hi, IIUC (correct me!) in BelleDIRAC you developed a specific Optimizer (in addition to those in https://github.com/DIRACGrid/DIRAC/tree/rel-v7r3/src/DIRAC/WorkloadManagementSystem/Executor) that creates the "scouting jobs" and move the "master job" status to SCOUTING. Before I answer your question, I have one myself: is what you have done very Belle2 specific? (I have the impression it is...).

iueda commented 1 year ago

Yes and No.

"Scout jobs" are created at the job submission -- when a user submits a set of jobs, our client tool makes a smaller set of shorter jobs as "scout jobs" and submits them (the original and the scout) altogether.

Then, our BelleDIRAC Optimizer (in BelleDIRAC/WorkloadManagementSystem/Executor) changes the status of the original jobs to "Scouting" while waiting for the execution of the scout jobs. We have an Agent that changes the status of the original jobs from "Scouting" to "Checking" so that they can go through the vanilla Optimizer.

The first one is Belle II specific in the sense we copy jobs with expecting some Belle II specific scripts in them. The latter two are supposed to be generic, as we have reported in the past.

See the slides at the last DUW https://indico.cern.ch/event/1107386/contributions/4846372/

===== Slide 10: What is included in BelleDIRAC Extensions of Vanilla systems WMS

Slide 13: What else is included in BelleDIRAC Features for end-users

Can be included as part of vanilla DIRAC. Scout job creation performed on BelleDIRAC side. But agent and executor are under WMS. So, possible (with some modifications).

Slide 21: Summary Potential new additions to Vanilla DIRAC:

=====

fstagni commented 1 year ago

What I would suggest is: