Pilot status for pilots in PollTime related sleep cycle

MarcusEbert commented 3 weeks ago

https://github.com/DIRACGrid/DIRAC/blob/60e2d82815da41db2d1dca17fdedaabaf2a41067/src/DIRAC/WorkloadManagementSystem/Client/PilotStatus.py#L23 For the pilot status options defined above, it seems there is no status to indicate a pilot that is running in the batch queue but did not get any payload in the last cycle.
Should this status be made available to the system for monitoring as well as for the sitedirector's decision about submitting new jobs? (If total number of pilots in such sleep mode is larger than the number of available payload, then no new pilot needs to get submitted)

fstagni commented 3 weeks ago

Indeed there's no such status. If we are considering making it a status, that would be something like "FINISHED_EMPTY", which would mean "Done with no matched payloads", because of course not matching jobs is not an error! What you are suggesting makes sense, for example, for resources for which pilots are submitted but for which there are no matching payloads, maybe because of non-fully-supported CPU types (I am thinking about ARM).

Introducing such status is not difficult, using it for decisions in SiteDirector requires a bit of accurate work. To be fair, I would have this in DiracX, because we are reluctant to implement new functionalities before that. Unless you want to give it a try yourself...

MarcusEbert commented 3 weeks ago

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

The main issue I see is the following we see on our site in production:

a pilot can run multiple payloads
user jobs are very short (let's say ~1h or less)
there are often times without available payloads for a site
even if there are 1000 payloads available to run on a site, the pilots submitted (one for each payload) will not start all at the same time

Before all pilots submitted to the site are running, the first that started already finished all available payload. Once there is no more payload available, all pilots are idle and try to poll for new payload periodically depending on "PollingTime" and the hardcoded increase of sleep time once a poll did not succeed in a payload (which would be good to have that also enabled/disabled via a configuration option). That means that in the above scenario a large amount of pilots can do nothing for a long time, basically wasting CPU resources that could be used otherwise, until new payload for a specific experiment arrives.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload. That results in more submitted pilots and again in a larger number of pilots that get no payload and go into sleep since the payload will already be processed when current sleeping pilots poll the next time.

What I suggest is that the site director submits new pilots based on (available payload for a site - idle pilots in the batch system - running pilots without a payload)

To do so, the status of such running pilots without payload needs to be known. Alternatively, one could get the number of running pilots without payload via (running pilots at a site - running payload at a site) which may not need a new status)

fstagni commented 3 weeks ago

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

OK, so I slightly misunderstood your first message: you are not talking about pilots that did not match any job, but pilots for which the last n cycles of the JobAgent did not match jobs.

... try to poll for new payload periodically depending on "PollingTime" and the hardcoded increase of sleep time once a poll did not succeed in a payload (which would be good to have that also enabled/disabled via a configuration option).

This can be easily done.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload.

The SiteDirector consumes info from the Computing Element. What you are suggesting to have is taking into consideration also:

the number of jobs pilots can potentially match
knowing that there are sleeping pilots

1) is almost impossible to assess. 2) is potentially possible.

What I suggest is that the site director submits new pilots based on (available payload for a site - idle pilots in the batch system - running pilots without a payload)

This is possible (but won't be much precise anyway)

To do so, the status of such running pilots without payload needs to be known. Alternatively, one could get the number of running pilots without payload via (running pilots at a site - running payload at a site) which may not need a new status)

Instead of having a status (that at pilot would be something like "RUNNING_IDLE" , or "SLEEPING", we can also increment or decrement a (central) counter.

MarcusEbert commented 2 weeks ago

I'm not sure "Finished_empty" is the right status since a pilot can run multiple payloads and does not need to finish once it couldn't not find a matching payload. It will just go into a sleep mode and try again later.

OK, so I slightly misunderstood your first message: you are not talking about pilots that did not match any job, but pilots for which the last n cycles of the JobAgent did not match jobs.

That's correct.

Also, when there are pilots in "sleep" mode between polling for payload and new payload arrives, the site director does not seem to take into account such pilots. It only seem to take into account pilots that are still idle from a batch system point of view, but not pilots that are running but have no payload.

The SiteDirector consumes info from the Computing Element. What you are suggesting to have is taking into consideration also:

the number of jobs pilots can potentially match

knowing that there are sleeping pilots

is almost impossible to assess. 2) is potentially possible.

1) may be possible if the system knows what a specific pilot could match and if the status of each pilot is known to be "Running_Idle"/"Sleeping".

Having 2. and assuming any job for that site can be matched to any of the sleeping pilots would help in a first step. Adding a config option for a site, e.g. "account for sleeping pilots = yes/no" could disable this feature in case a pilot can only match a specific payload and it is not known which payloads a pilot could potentially match.

What I suggest is that the site director submits new pilots based on (available payload for a site - idle pilots in the batch system - running pilots without a payload)

This is possible (but won't be much precise anyway)

Why would that no be more precise than how it is right now?

To do so, the status of such running pilots without payload needs to be known. Alternatively, one could get the number of running pilots without payload via (running pilots at a site - running payload at a site) which may not need a new status)

Instead of having a status (that at pilot would be something like "RUNNING_IDLE" , or "SLEEPING", we can also increment or decrement a (central) counter.

What do you suggest would be counted then?

DIRACGrid / DIRAC

Pilot status for pilots in PollTime related sleep cycle #7636