change how we handle user sandbox

belforte commented 1 year ago

Action item of postmortem CRAB TaskWorker went down after too many task submissions

Maybe there was a special reason in the past, but currently we may simply download sandbox.tar.gx from S3 in the scheduler. E.g. as part of https://github.com/dmwm/CRABServer/blob/master/scripts/AdjustSites.py

Problem:

user sandboxes use a lot of disk on TW and may quickly cumulate in case of repeated submissions.
as architecture goes, TW has no need of sandbox info, why should it touch it ?
spooling sandboxes during dag submissions all in all adds load to HTCondor and make operation take more time.

Things which we can do:

remove sandbox from TW disk once submission process is over (even if it failed, a new crab submit will create anyhow a new tmp directory) . A partial solution
go all the way and only download in scheduler. Scheduler code will also have to take care of debug subdirectory

Caveats:

currently TW needs sandbox for dryrun. But we already have in the list to use S3 for that https://github.com/dmwm/CRABServer/issues/6544
all in all we do not have any real, serious problem here. Do we really want to change ?

mapellidario commented 1 year ago

Wa and I had a quick chat about this. We had a simple concern. How should we deal with http errors when downloading sandboxes from s3 directly into the schedd? Because with this option we would have a task that is properly submitted with a running dagman_bootstrap, but that fails to retrieve the sandbox and can not submit jobs to the vanilla universe. how long do we keep trying downloading the sandbox? how many attempts do we make? should we put the dagman on hold if it fails for more than 1d? or should we just kill/remove it?

I am sorry, I have more questions than good proposals at the moment, but in general I like the proposal, we should try to avoid keeping files where they are not 100% needed.

belforte commented 1 year ago

there is no clear cut answer to those (valid) concerns. But keep in mind that dagman boostrap process already makes a few calls to CRAB REST. So nothing new on that side. Talking with S3 will be a new dependency. I do not know why original developers decided to send sandbox via condor file spooling, but it is possible that in original implementation there was no communication from scheduler to CRAB REST. We have a lot of such situations: original decisions stuck around even if original motivations were not valid anymore, different people took different decisions for similar things, something was done to mitigate the then-problem-of-the-day which eventually got solved otherwise, etc. Very few of those decisions have been documented.

So we are free to take the decision which we think is best, and will have to live with the consequences.

Note: failures in bootstrap: currently if something goes wrong in dagman bootstrap & C (e.g. can't talk with REST), everything is aborted and crab status will report a "task failed to bootstrap" and user submits again. If things go horribly wrong, the bootstrap does not manage to abort and task gets simply stuck in "waiting to bootstrap" and and crab status command will print "If this persists report it to ..computingTools…”

novicecpp commented 3 months ago

This happened again yesterday https://mattermost.web.cern.ch/cms-o-and-c/pl/9m9bcm3dnfbrj8h8i7t9id45no

belforte commented 3 months ago

let's start by keeping tmp directory for a shorter time #8542

novicecpp commented 1 month ago

I am looking at this issue today.

crab submit --dryrun break because InputFiles.tar.gz does not contains sandbox.tar.gz anymore. But we can ignore it because we want to remove it anyway.
Need to check crab preparelocal.

belforte commented 1 week ago

once https://github.com/dmwm/CRABServer/issues/6544 is done, user sandbox will not be downloaded on TW tmp disk any more. The original problem will be gone.

No further action is needed

dmwm / CRABServer

change how we handle user sandbox #7461