Open belforte opened 2 years ago
QUESTION: assuming that scheduler stays off for e.g. a couple of hours, what do we want to happen ?
a clarification note which I just sent to Stephan and SiteSupport at large:
current mechanisms does not check when jobs start to run/complete, simply submits one new job every 5 minutes, those may take some time to run, e.g. if site is busy several may pile up as Idle in condor queue and start all together or scattered at different times etc. So if e.g. CRAB scheduler (i.e. DAGMAN) is down for half an hour and 6 submissions are skipped, it is quite possible that submitting all of them at same time (2. below) is very similar to what would have happened anyhow.
current behavior was documented in https://github.com/dmwm/CRABServer/wiki/CRAB-vs-HammerCloud and I just updated it.
Hopefully it makes things clear enough to understand the schedd was restarted
thing. Crucial point is :
An "obvious" solution is to make all the DEFER internals in the DAG specification short (30~60 min) so that each PreJob is "continuously tested" and take all decisions about when to really run in the PreJob.py script. The problem with current situation is that deferral times written in the DAG specification file RunJobs.dag
can not be changed once DAG has started !
Hallo Stefano, the desired behaviour would be for the task to continue submitting where it left off before the restart at the 5 minute intervall, i.e. ignore the past/gap. I believe that is your don't-yet-know-how-to-do option 1). If i recall correctly, condor has a "run at specific time" option. May be switching from the relative-time of defer to that could be a way to overcome this? On the other hand, the shorter tasks but overall more tasks (not more tasks at a given time) could resolve the issue for us external to CRAB. Just some thoughts. Thanks,
amazingly enough, I have just added a new option 3. in https://github.com/dmwm/CRABServer/issues/7410#issuecomment-1263252621 which looks like what you suggest here !
That should be easy to do as well.
It is not really "condor has a run at specific time option", but one may create a DAG which achieves that by using a PRE script which "waits until that time". In practice we can't keep many PRE scripts in sleep because they use too much memory, so use the "run me again after X seconds" DEFER option.
hmmm... I suspect that if we skip some DAG nodes they will end up showing as Failed jobs in CRAB status which may confuse HC. So we may have to go for 1. I.e. shift submission series in time preserving 5 min interval and overall number of jobs in the task. On the good side, once we do https://github.com/dmwm/CRABServer/issues/7410#issuecomment-1263431929 it should be easy to code.
Hallo Stefano, yes, 3) would work fine for HC too. Thanks,
Regarding failure analysis: We are filtering out jobs that failed for system/condor issues, so if those failed jobs can be identified in Grafana, we can add them (if not filtered properly already). That would be straight forward.
thanks @stlammel I think I know enough to deal with this. It is not bloody urgent though.
We hit a couple of times the problem where HC appears to stop running at several sites. This is because when
schedd
restarts and DAGMAN is restarted, the current code does not cope with it gracefully: effectively delay count is restarted and jobs still in unsubmitted gets delayed like if task would have been submitted at time of theschedd
restart, instead of possibly 24h hour earlier or more.This leads to long gaps in HC submission and result at several sites, which is often overlooked.
When several
schedds
are restarted at same time, the effect becomes clearly visible and SiteSupport people contact us.This is the usual "code is perfect when everything works, but it is not protected against accidents". Relevant code is here https://github.com/dmwm/CRABServer/blob/e65c4a3ec53213df906ef82b7cb2b9d3e62e56d7/src/python/TaskWorker/Actions/PreJob.py#L474-L491 and here https://github.com/dmwm/CRABServer/blob/e65c4a3ec53213df906ef82b7cb2b9d3e62e56d7/src/python/TaskWorker/Actions/DagmanCreator.py#L543
We should find (maybe already have) a better place than this issues to describe how the deferred start works under the hood. Let's stick here to the problem.
The problem is that when
DagmanCreator.py
creates the DAG structure, it puts absolute delays for each job, so when it is restarted afterschedd
restarts, delays are reset as "from now" rather than as "from job start".