dmwm / CRABServer

15 stars 38 forks source link

allow use other ancestors than immediate parents #4861

Closed belforte closed 9 years ago

belforte commented 9 years ago

e.g. to connect MINIAOD to RECO/RAW in case MINIAOD was created from AOD. See https://hypernews.cern.ch/HyperNews/CMS/get/computing-tools/720/1.html

user could indicate a parent dataset and Crab will have to put proper file names in each job config

Maybe @matz-e can be interested ? since it is a (reasonably complex but hopefully not too much) extension of what done for parents.

ericvaandering commented 9 years ago

We should check with the offline folks if siblings are possible too. I think they are.

belforte commented 9 years ago

?? siblings do not exist in DBS.

Do you mean something like producion makes RECO and AOD in one step and user wants to read both in the same cmsRun ?

ericvaandering commented 9 years ago

Exactly. I presume we only support finding the parent in DBS and if someone wants grandparents or sisters or cousins they take responsibility for themselves to give two datasets. All crab needs to care about is runs/lumis/files.

Sent from a mobile device. Please excuse my brevity or transcription errors.

On Jul 1, 2015, at 08:55, Stefano Belforte notifications@github.com wrote:

?? siblings do not exist in DBS.

Do you mean something like producion makes RECO and AOD in one step and user wants to read both in the same cmsRun ? — Reply to this email directly or view it on GitHub.

belforte commented 9 years ago

I see. You mean something like: split a task on dataset A, then for each job look at which runs/lumis/events I gave as input, find same runs/lumis/events in dataset B and add the needed files to the cmsRun config. (and of course no guarantee that correspondence can be found).

I thought one would follow up ancestry from A to B and get the list of files that way, which is more solid. But yes, the former is more general although it may take some time to search things in B, especially if we want to avoid making thousands of DBS queries keyed on run/lumi.

let's make sure it is really helpful, before we code it :-)

On 07/01/2015 03:58 PM, Eric Vaandering wrote:

Exactly. I presume we only support finding the parent in DBS and if someone wants grandparents or sisters or cousins they take responsibility for themselves to give two datasets. All crab needs to care about is runs/lumis/files.

Sent from a mobile device. Please excuse my brevity or transcription errors.

On Jul 1, 2015, at 08:55, Stefano Belforte notifications@github.com wrote:

?? siblings do not exist in DBS.

Do you mean something like producion makes RECO and AOD in one step and user wants to read both in the same cmsRun ? — Reply to this email directly or view it on GitHub.

— Reply to this email directly or view it on GitHub https://github.com/dmwm/CRABServer/issues/4861#issuecomment-117684213.

Stefano Belforte - I.N.F.N. tel : +39 040 375-6261 (fax: 375-6258) Area di Ricerca - Padriciano 99 tel mobile: +39 328 010 7327 34012 TRIESTE TS - Italy AIM: stefanobelforte

ericvaandering commented 9 years ago

Yes, that is my thought.

matz-e commented 9 years ago

So what's the verdict on necessity? In the past, I could have used this, as I had datasets with broken parentage that I needed to run over in combination with RAW…

It's not as easy as adding parentage support, but seems still feasible. We could add a configuration parameter secondaryDataset, and all file-lumi data for said dataset would have to be slurped in one go in the crab server and matched up with the primary dataset.

I'd then abuse the current parent logic by overwriting the parentage in the data discovery result for the primary dataset. The machinery in place now should then take care of the rest. In essence, use_parent could be done this way, too.

belforte commented 9 years ago

I suspect this is going to be needed at least for MINIADO + RECO. And possibly the general implementation is only a bit more work and more useful anyhow. So we should look into that. It one wanted to be nice to users, we could consider if we can find the ancestor dataset automatically given the tier name. I.e. I run on this MINIAOD and want to get also the RECO or the RAW. But I am not sure that every dataset we produce has unique ancestors etc.. GUess goes toghether with "what we do if datset B does not have run/lumi x/y which is present in A" ?

matz-e commented 9 years ago

I have had the ancestor chain "rip" before. Insofar, there may be two configuration methods needed: direct specification of a secondary dataset and specification of a "parent" datatier, i.e., RAW?

belforte commented 9 years ago

I'd go for an ancestor datatier. Somehow I think it would be more useful. At least the only person who brought this up so far has such a use case. But this is guess-land. I suspect the trick is to do a good error catching and exception handling, if you look up the ancestor dataset name yourself things are a bit more under control. What i do not know is if we can be guaranteed that there is always 1:1 mapping of dataset names, I suspect this is not enforced in any tool, while it may be the case one simply looks around.

matz-e commented 9 years ago

Generic case where the parentage is not straightforward: Winter GEN-SIM and AODSIM

The generic case would be to allow a generic secondary dataset where the user has to do some research. A specialization of this would be CRAB looking up said secondary dataset w/o any guesswork.

belforte commented 9 years ago

@matz-e I do not understand this comment. you asked DAS for parent dataset and got the right one ?So what "some research" is needed ? Or are you saying that DAS answer is wrong ? But.. mostly.. if I can ask: are you going to look into coding this ? It was never clear if anybody is doing it :-)

ericvaandering commented 9 years ago

What he's saying by "research" is that the user has to know that the two datasets are related in a way that they can be used together in the two file solution. There was some discussion on Tuesday whether we wanted some functionality like "use grandparent" and "use great grandparent" and I argued that anything other than use parent we should make the user just do it and not bother CRAB.

The counter argument is that DBS knows parent files but in making the user do it, one has to associate the files for each job based on lumi numbers (marginally more work for CRAB).

belforte commented 9 years ago

Thanks Eric. Sorry for missing so many meetings. I do not like guesswork here. Here's a proposal: 1- user say "parent" : all clear . we have it 2- user say "use dataset X as parent": all clear (may fail of course). implement next. 3- user say "parent tier T" : crab crawls up the parentage until a dataset with tier T and use that as parent. If this is not found (e.g. for the example Matthias wrote there is no RECO) we say sorry, and stop. If there there is a RECO ancestor that comes from another RECO.. too bad, we stop at first one. (exception handling in 3- is more work)

the more I think about it, the more I suspect we do not want to implement 3- but rather tell user to figure out themselves, if nothing else, because there's a moral duty to know exactly which data I am looking at, if I want to have any chance to make sense of my plots. Can always tell them "keep clicking on parent link in DAS" :-)

ericvaandering commented 9 years ago

Right. It's not clear to me that this is common enough that #3 is a priority.

belforte commented 9 years ago

even if it were common :-) simply it may be ~ impossible to have an algorithm which finds the correct ancestor when immediate parent is not good enough.

On 07/30/2015 06:07 PM, Eric Vaandering wrote:

Right. It's not clear to me that this is common enough that #3 is a priority.

belforte commented 9 years ago

genreally speaking, nothing prevents one dataset from having multiple parents. We should make sure this is either properly handled or properly flagged as not supported when detected.

belforte commented 9 years ago

by the way, I am delighted that DBS3 has the concept of parent datasets.

matz-e commented 9 years ago

genreally speaking, nothing prevents one dataset from having multiple parents. We should make sure this is either properly handled or properly flagged as not supported when detected.

That's another issue. The scope of this one is to allow a user to specify a dataset to use as secondary input.

belforte commented 9 years ago

yes. And within that scope my comment means: make sure we detect and report in case the indicated dataset does not fully cover the needed run/lumi range, rather them simply process some files with [grand]parents and some w/o them.

matz-e commented 9 years ago

OK, I was thinking that a dataset could have two parents, as in RAWRECO having RAW and RECO as parents. There's an ambiguity of which to choose…

belforte commented 9 years ago

I see. That's indeed a very good argument for "never try to guess the right ancestor!" I was rather worried at something like dimuon_June/RECO and dimuon_July/RECO being both processed into dimoun_run2/MINIAOD. Which maybe we have not done yet, but is technically possible.