Store long term user sandboxes for reuse/check months later

annawoodard commented 9 years ago

as also noted by @matz-e

Users frequently need to process various datasets with the same pset and CMSSW release. We should allow them to “recycle” their sandbox— in other words, if they’ve already started a task with a particular configuration, they could point crab to the already-made sandbox instead of recreating it for subsequent tasks. This can save several hours when using multicrab (CRABClientLibraryAPI) with O(100) different datasets and a complex pset that takes a long time to load. It would also reduce issues of this type: user processes datasets A, B, C —> time passes, user makes changes to their usercode —> user realizes they need dataset D processed, and want to ensure that it is exactly the same as datasets A-C. They can already save the pset they used, but they have no guarantee they will get the same output unless they make sure that their usercode was the same also, which is not always trivial to reconstruct.

I would argue that it would also be useful to recycle the usercode separately from the pset, but that’s an added complication, and I think 90% of the possible benefit would be gained in the simpler case of allowing recycling of the sandbox as a unit.

belforte commented 9 years ago

I thought "multicrab" did this already ! Surely it can be done there. Recycling is easy. problem is preservation for the use case where "time passes". How long should we store the sandbox for ? And how urgent could this be ? last question is because, as I will tell on thursday, I was thinking whether we could off-source the business of preserving users configuration and sandbox to the CERN data preservation people, aka invenjo. Rather then adding more services to CRAB.

PerilousApricot commented 9 years ago

Even if it's "someone else's" job to preserve the sandbox and configuration, there still needs to be the option to actually load them into a task.

+1 to the ticket, some of the more complicated configurations can take literally forever for CMSSW to parse

PerilousApricot commented 9 years ago

Sorry, I've done some thought about it before and forgot the other big reason - right now if you're trying to have a coordinated effort to turn a whole batch of input datasets into a different batch of output datasets, you're basically dependent on people not screwing up the checkout used to generate the samples. With this, if someone wants to add a new dataset that was previously missed, instead of hoping that they check out and compile the code the right way, the person in charge of the configuration can just say, "The v5 of the samples use 42dcabf65b6df20b2ce431f8c0c26736" and be done with it.

belforte commented 9 years ago

ah !! Now you talk like Mr. Analysis Preservation. We should try to convince you to take over full responsibility for this. CERN people graciously offer to store those ISB's forever for us, of course some pull/push machinery is needed on our side. Yeah.. we can start saving the ISB's somwhere (even on schedd's) for O(1 month) to wet user's appetite and worry later about "let me redo what I did last fall". There are the usual zilion details to define, all I am saying is that I'd rather not be responsible for another service and DB if I can avoid.

PerilousApricot commented 9 years ago

I'm not arguing that they need to be preserved forever. A policy of "if unused, UFC entries are purged after N days" is fine -- it's what we do for ~everything else in the CRAB machinery, afterall.

annawoodard commented 9 years ago

I think it would make sense to keep the preservation issue separate. As a first step, we could let the user be responsible for their own preservation, by letting them specify a local copy of their sandbox.

belforte commented 9 years ago

Yes, but only if we save the ISB hash and can verify it. I think the user should say "use same ISB as this task". Them saying "use this ISB" is too dangerous, is OK for experts and debugging, but if we give it to the world there will be too many "interesting" questions on the support list. Anyhow multicrab is different, there we should reuse it internally, somehow.

PerilousApricot commented 9 years ago

If someone manages to fumble-finger a sha1 hash into a collision with another extant hash in UFC, we've got bigger problems (namely, the universe is about to end because we've somehow managed to birthday paradox a 2^160 keyspace).

annawoodard commented 9 years ago

If someone manages to fumble-finger a sha1 hash into a collision with another extant hash in UFC, we've got bigger problems (namely, the universe is about to end because we've somehow managed to birthday paradox a 2^160 keyspace).

Good point :)

Them saying "use this ISB" is too dangerous, is OK for experts and debugging, but if we give it to the world there will be too many "interesting" questions on the support list.

Hi Stefano. I'm sorry if this is already obvious and I missed it, but can you explain more what is too dangerous? Maybe there are checks that we could do to mitigate the problems. I agree with you that long-term preservation is outside the scope of CRAB3. I think that allowing the user to specify a local sandbox is a compromise that allows them to take on that responsibility themselves, without adding much complexity in CRAB.

Just summarizing from above, it seems that there are several related issues, each of which might have different optimal solutions.

Long timescale: users want to be able to reprocess data in exactly the same way, sometimes months down the line. I would argue that, until there is a solution from "whoever's job preservation is", the simplest thing is to allow them to specify a local sandbox (that they can commit to their analysis code repos with a tag or wherever they want for "preservation.")
Middle timescale: users want to be able to work in groups to complete large processing tasks, or they want to add missing datasets soon after the processing run has finished. Here Andrew's take on this, of allowing users to specify the hash in the UFC, makes the most sense ("The v5 of the samples use 42dcabf65b6df20b2ce431f8c0c26736.") This would have the caveat that sandboxes would be purged regularly.
Short timescale: users do not want to make a new sandbox for every dataset they process with multicrab, because it can be very slow. This may require a different, internal solution, as Stefano suggests. From glancing at the API, it is not obvious to me what approach would be best, maybe @mmascher has an idea? Thinking out loud-- if we implemented 2. above, we could skip an internal solution and let the user take care of this themselves. (They submit the first task, grab the hash of the created sandbox, and set that as the "recycled sandbox" for subsequent tasks.)

belforte commented 9 years ago

Anna, in a word: yes. In more words: in your point 2 I was trying to say that instead-of/in-addition-to saying "find ISB with hash=xxx and use it" user could say "find ISB for task=yyyy and use it". What I meant to be too dangerous is that a user points to a local file, which may not be the right format etc. But as all worries, if others do not share it, it is likely an excessive worry and let's not worry. In any case it sounds like we want to store the ISB for them, up to 2~3 months e.g. And then let's see how much space it takes and how much request there is for longer term storage. Then if we can have some way to make sure that ISB and config. are consistent, even better, but I suspect we do not.

mmascher commented 9 years ago

emaszs commented 7 years ago

Number 3. in the list was done. 1. and 2. take serious work (like a new UFC) and design and we may never get to it. I changed the issue subject to reflect current scope.

dmwm / CRABServer

Store long term user sandboxes for reuse/check months later #4715