dmwm / CRABServer

15 stars 38 forks source link

change how we handle user sandbox #7461

Closed belforte closed 1 week ago

belforte commented 1 year ago

Action item of postmortem CRAB TaskWorker went down after too many task submissions

Maybe there was a special reason in the past, but currently we may simply download sandbox.tar.gx from S3 in the scheduler. E.g. as part of https://github.com/dmwm/CRABServer/blob/master/scripts/AdjustSites.py

Problem:

Things which we can do:

Caveats:

mapellidario commented 1 year ago

Wa and I had a quick chat about this. We had a simple concern. How should we deal with http errors when downloading sandboxes from s3 directly into the schedd? Because with this option we would have a task that is properly submitted with a running dagman_bootstrap, but that fails to retrieve the sandbox and can not submit jobs to the vanilla universe. how long do we keep trying downloading the sandbox? how many attempts do we make? should we put the dagman on hold if it fails for more than 1d? or should we just kill/remove it?

I am sorry, I have more questions than good proposals at the moment, but in general I like the proposal, we should try to avoid keeping files where they are not 100% needed.

belforte commented 1 year ago

there is no clear cut answer to those (valid) concerns. But keep in mind that dagman boostrap process already makes a few calls to CRAB REST. So nothing new on that side. Talking with S3 will be a new dependency. I do not know why original developers decided to send sandbox via condor file spooling, but it is possible that in original implementation there was no communication from scheduler to CRAB REST. We have a lot of such situations: original decisions stuck around even if original motivations were not valid anymore, different people took different decisions for similar things, something was done to mitigate the then-problem-of-the-day which eventually got solved otherwise, etc. Very few of those decisions have been documented.

So we are free to take the decision which we think is best, and will have to live with the consequences.

Note: failures in bootstrap: currently if something goes wrong in dagman bootstrap & C (e.g. can't talk with REST), everything is aborted and crab status will report a "task failed to bootstrap" and user submits again. If things go horribly wrong, the bootstrap does not manage to abort and task gets simply stuck in "waiting to bootstrap" and and crab status command will print "If this persists report it to ..computingTools…”

novicecpp commented 3 months ago

This happened again yesterday https://mattermost.web.cern.ch/cms-o-and-c/pl/9m9bcm3dnfbrj8h8i7t9id45no

belforte commented 3 months ago

let's start by keeping tmp directory for a shorter time #8542

novicecpp commented 1 month ago

I am looking at this issue today.

belforte commented 1 week ago

once https://github.com/dmwm/CRABServer/issues/6544 is done, user sandbox will not be downloaded on TW tmp disk any more. The original problem will be gone.

No further action is needed