dmwm / CRABServer

16 stars 38 forks source link

deferred job start (for HC) gets delayed affer schedd restart #7410

Open belforte opened 2 years ago

belforte commented 2 years ago

We hit a couple of times the problem where HC appears to stop running at several sites. This is because when schedd restarts and DAGMAN is restarted, the current code does not cope with it gracefully: effectively delay count is restarted and jobs still in unsubmitted gets delayed like if task would have been submitted at time of the schedd restart, instead of possibly 24h hour earlier or more.

This leads to long gaps in HC submission and result at several sites, which is often overlooked.

When several schedds are restarted at same time, the effect becomes clearly visible and SiteSupport people contact us.

This is the usual "code is perfect when everything works, but it is not protected against accidents". Relevant code is here https://github.com/dmwm/CRABServer/blob/e65c4a3ec53213df906ef82b7cb2b9d3e62e56d7/src/python/TaskWorker/Actions/PreJob.py#L474-L491 and here https://github.com/dmwm/CRABServer/blob/e65c4a3ec53213df906ef82b7cb2b9d3e62e56d7/src/python/TaskWorker/Actions/DagmanCreator.py#L543

We should find (maybe already have) a better place than this issues to describe how the deferred start works under the hood. Let's stick here to the problem.

The problem is that when DagmanCreator.py creates the DAG structure, it puts absolute delays for each job, so when it is restarted after schedd restarts, delays are reset as "from now" rather than as "from job start".

belforte commented 2 years ago

QUESTION: assuming that scheduler stays off for e.g. a couple of hours, what do we want to happen ?

  1. resume submissions at 5 min inteval with a gap ?
  2. or submit immediately all jobs which were skipped, sort of like catching up quickly with original schedule ?
  3. skip jobs which should have been submitted while dagman was not running and resume "like if nothing happened".
belforte commented 2 years ago

a clarification note which I just sent to Stephan and SiteSupport at large:

current mechanisms does not check when jobs start to run/complete, simply submits one new job every 5 minutes, those may take some time to run, e.g. if site is busy several may pile up as Idle in condor queue and start all together or scattered at different times etc. So if e.g. CRAB scheduler (i.e. DAGMAN) is down for half an hour and 6 submissions are skipped, it is quite possible that submitting all of them at same time (2. below) is very similar to what would have happened anyhow.

belforte commented 2 years ago

current behavior was documented in https://github.com/dmwm/CRABServer/wiki/CRAB-vs-HammerCloud and I just updated it. Hopefully it makes things clear enough to understand the schedd was restarted thing. Crucial point is :

belforte commented 2 years ago

An "obvious" solution is to make all the DEFER internals in the DAG specification short (30~60 min) so that each PreJob is "continuously tested" and take all decisions about when to really run in the PreJob.py script. The problem with current situation is that deferral times written in the DAG specification file RunJobs.dag can not be changed once DAG has started !

stlammel commented 2 years ago

Hallo Stefano, the desired behaviour would be for the task to continue submitting where it left off before the restart at the 5 minute intervall, i.e. ignore the past/gap. I believe that is your don't-yet-know-how-to-do option 1). If i recall correctly, condor has a "run at specific time" option. May be switching from the relative-time of defer to that could be a way to overcome this? On the other hand, the shorter tasks but overall more tasks (not more tasks at a given time) could resolve the issue for us external to CRAB. Just some thoughts. Thanks,

belforte commented 2 years ago

amazingly enough, I have just added a new option 3. in https://github.com/dmwm/CRABServer/issues/7410#issuecomment-1263252621 which looks like what you suggest here !

That should be easy to do as well.

It is not really "condor has a run at specific time option", but one may create a DAG which achieves that by using a PRE script which "waits until that time". In practice we can't keep many PRE scripts in sleep because they use too much memory, so use the "run me again after X seconds" DEFER option.

belforte commented 2 years ago

hmmm... I suspect that if we skip some DAG nodes they will end up showing as Failed jobs in CRAB status which may confuse HC. So we may have to go for 1. I.e. shift submission series in time preserving 5 min interval and overall number of jobs in the task. On the good side, once we do https://github.com/dmwm/CRABServer/issues/7410#issuecomment-1263431929 it should be easy to code.

stlammel commented 2 years ago

Hallo Stefano, yes, 3) would work fine for HC too. Thanks,

stlammel commented 2 years ago

Regarding failure analysis: We are filtering out jobs that failed for system/condor issues, so if those failed jobs can be identified in Grafana, we can add them (if not filtered properly already). That would be straight forward.

belforte commented 2 years ago

thanks @stlammel I think I know enough to deal with this. It is not bloody urgent though.