Open annawoodard opened 9 years ago
I thought "multicrab" did this already ! Surely it can be done there. Recycling is easy. problem is preservation for the use case where "time passes". How long should we store the sandbox for ? And how urgent could this be ? last question is because, as I will tell on thursday, I was thinking whether we could off-source the business of preserving users configuration and sandbox to the CERN data preservation people, aka invenjo. Rather then adding more services to CRAB.
Even if it's "someone else's" job to preserve the sandbox and configuration, there still needs to be the option to actually load them into a task.
+1 to the ticket, some of the more complicated configurations can take literally forever for CMSSW to parse
Sorry, I've done some thought about it before and forgot the other big reason - right now if you're trying to have a coordinated effort to turn a whole batch of input datasets into a different batch of output datasets, you're basically dependent on people not screwing up the checkout used to generate the samples. With this, if someone wants to add a new dataset that was previously missed, instead of hoping that they check out and compile the code the right way, the person in charge of the configuration can just say, "The v5 of the samples use 42dcabf65b6df20b2ce431f8c0c26736" and be done with it.
ah !! Now you talk like Mr. Analysis Preservation. We should try to convince you to take over full responsibility for this. CERN people graciously offer to store those ISB's forever for us, of course some pull/push machinery is needed on our side. Yeah.. we can start saving the ISB's somwhere (even on schedd's) for O(1 month) to wet user's appetite and worry later about "let me redo what I did last fall". There are the usual zilion details to define, all I am saying is that I'd rather not be responsible for another service and DB if I can avoid.
I'm not arguing that they need to be preserved forever. A policy of "if unused, UFC entries are purged after N days" is fine -- it's what we do for ~everything else in the CRAB machinery, afterall.
I think it would make sense to keep the preservation issue separate. As a first step, we could let the user be responsible for their own preservation, by letting them specify a local copy of their sandbox.
Yes, but only if we save the ISB hash and can verify it. I think the user should say "use same ISB as this task". Them saying "use this ISB" is too dangerous, is OK for experts and debugging, but if we give it to the world there will be too many "interesting" questions on the support list. Anyhow multicrab is different, there we should reuse it internally, somehow.
If someone manages to fumble-finger a sha1 hash into a collision with another extant hash in UFC, we've got bigger problems (namely, the universe is about to end because we've somehow managed to birthday paradox a 2^160 keyspace).
If someone manages to fumble-finger a sha1 hash into a collision with another extant hash in UFC, we've got bigger problems (namely, the universe is about to end because we've somehow managed to birthday paradox a 2^160 keyspace).
Good point :)
Them saying "use this ISB" is too dangerous, is OK for experts and debugging, but if we give it to the world there will be too many "interesting" questions on the support list.
Hi Stefano. I'm sorry if this is already obvious and I missed it, but can you explain more what is too dangerous? Maybe there are checks that we could do to mitigate the problems. I agree with you that long-term preservation is outside the scope of CRAB3. I think that allowing the user to specify a local sandbox is a compromise that allows them to take on that responsibility themselves, without adding much complexity in CRAB.
Just summarizing from above, it seems that there are several related issues, each of which might have different optimal solutions.
Anna, in a word: yes. In more words: in your point 2 I was trying to say that instead-of/in-addition-to saying "find ISB with hash=xxx and use it" user could say "find ISB for task=yyyy and use it". What I meant to be too dangerous is that a user points to a local file, which may not be the right format etc. But as all worries, if others do not share it, it is likely an excessive worry and let's not worry. In any case it sounds like we want to store the ISB for them, up to 2~3 months e.g. And then let's see how much space it takes and how much request there is for longer term storage. Then if we can have some way to make sure that ISB and config. are consistent, even better, but I suspect we do not.
A related feedback https://hypernews.cern.ch/HyperNews/CMS/get/crabDevelopment/2355/1.html
Number 3. in the list was done. 1. and 2. take serious work (like a new UFC) and design and we may never get to it. I changed the issue subject to reflect current scope.
as also noted by @matz-e
Users frequently need to process various datasets with the same pset and CMSSW release. We should allow them to “recycle” their sandbox— in other words, if they’ve already started a task with a particular configuration, they could point crab to the already-made sandbox instead of recreating it for subsequent tasks. This can save several hours when using multicrab (CRABClientLibraryAPI) with O(100) different datasets and a complex pset that takes a long time to load. It would also reduce issues of this type: user processes datasets A, B, C —> time passes, user makes changes to their usercode —> user realizes they need dataset D processed, and want to ensure that it is exactly the same as datasets A-C. They can already save the pset they used, but they have no guarantee they will get the same output unless they make sure that their usercode was the same also, which is not always trivial to reconstruct.
I would argue that it would also be useful to recycle the usercode separately from the pset, but that’s an added complication, and I think 90% of the possible benefit would be gained in the simpler case of allowing recycling of the sandbox as a unit.