dmwm / DBS

CMS Dataset Bookkeeping Service
Apache License 2.0
7 stars 21 forks source link

DBS search patterns #85

Closed ericvaandering closed 10 years ago

ericvaandering commented 10 years ago

Original TRAC ticket 2326 reported by yuyi This ticket is for discussing search patterns we will be using in DBS, sych as search on dataset, block_name and so on. Please move the discussion here instead of #1949 or #2280

ericvaandering commented 10 years ago

Author: valya Replying to Dirk's comments (for completeness I paste them here):

Dirk: I think requiring the users to have some idea of what they are looking for is not too much to ask for, is it ? Do we really want to allow wildcard searches across Primary Dataset and Processed Dataset at the same time ?

VK: Certainly, but "some" idea is a pattern. If user knows Top, he/she need to specify it in a pattern, not in a some structure.

Dirk: IMHO, searching for datasets, you can use wildcards, but you need to specify the dataset syntax and specify where you want the wildcards. So Top throws an error. If they want Top, they can run the two searches /Top//Tier and //Top/Tier (ignoring he tier here because I don't see the use case for wildcards there at all). They are different searches for different things.

VK: Things get really wild once you move from a single pattern to multiple, e.g. how about this _Sping_TopRE. If slashes are enforced into pattern search, please make your combinatoric exercise and tell me how many different paths you'll find here. I bet it would be more then 3. I don't want to invent specific algorithm to construct all possible combinations for all possible patterns user may have.

Final word, no one YET provided a real example how DB will struggle between parsing Top and /Top//. Don't complicated things, if it is not required!

ericvaandering commented 10 years ago

Author: valya I want to re-phrase my thoughts on this subject.

  1. I am NOT against restricting patterns.
  2. I think strong validation is required once you insert data, asking for 3 slashes here is a benefit
  3. I think loose validation can be applied once you ask for data
  4. Taking into account that ALL datasets/blocks start with slash, we can safely say that search with /* is equivalent to search with *. Since first slash applied to all dataset, Oracle DB will do full table scan anyway.
  5. Requiring slashes somewhere in a pattern does not mean that pattern should be in a form of having exact three slashes.
  6. Yes we're not Google, but common sense should be applied.

If you disagree with any statement above please provide a real example(s) of how a) DB will benefit (e.g. make it run faster), b) DBS will benefit (e.g. API will be more simple), c) user will benefit (e.g. you type less to get your results). No one yet provided me strong argument how we can benefit (at any level) by using more complex search pattern.

ericvaandering commented 10 years ago

Author: yuyi To be specific for the discussion. I list the RE and the patterns we used in the unit tests.

The current RE for dataset search is : """ r"^/(|[a-zA-Z][a-zA-Z0-9]{0,100})(/(|[a-zA-Z0-9.-]{1,100})){0,1}(/(|[A-Z-]{1,50})){0,1}$" """ Here are the examples of valid and invalid dataset search patterns under this RE. Valid patterns: """ ds1 = '/Higgs/blah-v2/RECO' ds2 = '/Higgs/blah-v2/RECO' ds3 = '//blah-v2/RECO' ds4 = '/Higgs/blah/RECO' ds5 = '/Higgs/blah-v2/' ds6 = '///RECO' ds7 = '/Higgs//' ds8 = '//blah-v2/' ds9 = '/QCD_EMenriched_Pt30to80/Summer08_IDEAL_V11_redigiv/GEN-SIM-RAW' ds10 ='/' ds11 = '/QCD' ds12 = '/QCD/Summer_' ds13 = '/QCD_EMenriched_Pt30to80/Summer08_IDEAL_V11_redigiv2/GEN-SIM-' ds14 = '/QCD_EMenriched_Pt30to80/Summer08_IDEAL_V11_redigiv2/-SIM-' """ Invalid patterns: """ ds1 = '/blah-v2/RECO' ds2 = '/blah-v2/' ds3 = '///RECO' ds4 = '////' ds5 = '' ds6 = '/Higgs/ /RECO' ds7 = '/Higgs/%/RECO' ds8 = 'Higgs' ds11 = '/MinimumBias/BeamCommissioning09-PromptReco-v2/RECO#bdd066ce-e8fb-488e-beb1-20432d96baaa' """

ericvaandering commented 10 years ago

Author: hufnagel The fundamental problem I have here is an old one. "Dataset" as most in CMS understand it does not exist. It's a datasetPATH, a compound object composed by a well defined procedure out of three parts.

If you are searching for a compound item like a dataset(path), the interface needs to be very clear about what single search terms like Top or Top or Top* mean. Because you always have to map these to different parts of the compound item and each user can have a different idea of how the mapping works.

IMHO, the search interface should just avoid these ambiguities altogether and always require resolving the compound items into their separate parts. Either by entering a search term with a clearly defined divider (the slashes) or by specifying the individual parts (like searching for a primary dataset instead).

Also, I believe this is along the lines of how the discovery page and command line interface are working, so users are already used to it.

Following these rules, from you list, the following are invalid

ds10 ds11 ds12

Maybe a more general comment, IMHO just because we can accept some ambiguous user input and make up some rules to defer what they actually wanted doesn't mean it's a good idea in terms of ending up with a clear and easy to understand user interface. I would start with the maxim of "less is more" and go from there. It's always easy to support more use cases later on (if they really present themself).

ericvaandering commented 10 years ago

Author: yuyi I think you were changing the subject here.

Dataset is short name for datasetpath. IMHO, it is not matter what we call it, it is more important that we give a term consistent meaning and stick on. User will accept and be used to it.

DBS api listdataset accepts more than handful arguments, primary dataset, data tier and so on can be specified separately for the API if a user knows what part of the dataset/path they are looking for. However, most users like to take the advantage of a combined term dataset/path to do their search instead of using individual ones. The question is how we can server this use case?

ericvaandering commented 10 years ago

Author: meloam

The fundamental problem I have here is an old one. "Dataset" as most in CMS understand it does not exist. It's a datasetPATH, a compound object composed by a well defined procedure out of three parts.

Yeah, datasets are actually "dataset paths" and the implementation (AFAIK) splits the dataset into three fields on the backend of DBS, but that's an implementation detail, and I'm not convinced that the user should be concerned with it. We don't require users to keep track of the different parts when they use crab or search in phedex. If we want to expose a web interface to users, then it should be full of "do what I mean" and not http://stuffthathappens.com/blog/2008/03/05/simplicity/

Either way, we can all wax on whether apples or nuclear reactors are tastier/generates more electricity. Doing foo instead of /_/foo/_ involves a couple more row scans (for the primary data set and tier columns), but are there really thousands of primary datasets and tiers? Do we know the performance actually sucks enough to give users another wrinkle? There's certainly a tradeoff somewhere, but I think it lands on the side of simplifying the interface to hide the implementation details.

ericvaandering commented 10 years ago

Author: hufnagel I do not really care about the performance here. I care about the clarity of the interface.

If you search for /QCD, you can interpret this is either QCD being part of the primary dataset or part of either the primary dataset, the processed dataset or the tier. For users that have no idea that datasetpath is three parts, the later is maybe more intuitive. For users that do know the datasetpath is three parts, the former is more intuitive. For me at least I would expect it to only give me results where QCD is part of the primary dataset.

My point here is that you cannot support searches like this without defining mapping rules. Which means there is an ambiguity in the interface and some users will get confused by what they get back.

And at that point the cost/benefit for me just isn't there. The gain (saving to type three characters in your search) just isn't worth the possible downsides IMO.

Also, I do not agree with "we should not make users care about the internal structure of a datasetpath". Maybe they do not need to know about that it's three parts and how it's constructed. But everyone should be aware of naming conventions and datasetpaths are /A/B/C. Also, by now everyone should know that.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:6 meloam]:

Yeah, datasets are actually "dataset paths" and the implementation (AFAIK) splits the dataset into three fields on the backend of DBS, but that's an implementation detail, and I'm not convinced that the user should be concerned with it.

NO. We don't split the dataset/path into different parts. If a user give us a dataset/path, we will search on that.

ericvaandering commented 10 years ago

Author: hufnagel Replying to [comment:8 yuyi]:

NO. We don't split the dataset/path into different parts. If a user give us a dataset/path, we will search on that.

But datasetpaths are constructed as /PrimDS/ProcDS/Tier ! Why would you allow random string searches against them ???

ericvaandering commented 10 years ago

Author: valya Dirk, a student who needs to find Top sample should not scratch his head to do so. He/she does not need to spend time to figure out how they should type a string field, Top, /Top, /_/_Top, etc. Search should be simple, you type and search engine guides you, not another way around.

From DB point of view, search pattern in a strings (dataset paths) is trivial operation. Why you want to complicate it? We don't have billion of dataset paths, and ORACLE is quite capable of finding VERY quickly Top pattern in ~10K strings. I don't see ANY abuse of the system.

ericvaandering commented 10 years ago

Author: meloam Replying to [comment:9 hufnagel]:

Replying to [comment:8 yuyi]:

NO. We don't split the dataset/path into different parts. If a user give us a dataset/path, we will search on that.

But datasetpaths are constructed as /PrimDS/ProcDS/Tier ! Why would you allow random string searches against them ???

Something that's happened quite often recently is people will say, "hey, the new ZJets MC is in. Melo, will you pat-tuple them?". Perhaps I should know better (like you said above), but until this conversation, I didn't know there was a difference. I would go into the phedex subscription page, type in ZJets, scroll till I found the Summer11 sample and subscribe it to a site.

To you, they're a number of well-ordered components, but to me it's just an opaque blob. I don't care about the structure. I just want to fire off the pat-tuple production so I can get back to doing something not boring and menial

ericvaandering commented 10 years ago

Author: hufnagel You are knowledgeable though, just lazy :-). Which means you can make sense of what you get back.

In general, if we allow this, we'll confuse some people. You know where to look for the ZJets string in the result set and pick the correct sample. That hypothetical student might not.

Tradeoffs.

ericvaandering commented 10 years ago

Author: hufnagel Example, AFAIK we do have common substrings in Primary Datasets and in Skim names. A global search for something like that will return results that contain both. Without some knowledge of what the subparts of the datasetpath actually mean and knowledge of what you are looking for, you are lost.

There is no way to avoid the "garbage in == garbage out" problem here. I just like the restrictive search interfaces more because they force you to think about what you actually want before you type. But I realize this is a bit of a philosophical preference.

ericvaandering commented 10 years ago

Author: yuyi Replying to [comment:9 hufnagel]:

But datasetpaths are constructed as /PrimDS/ProcDS/Tier ! Why would you allow random string searches against them ???

Dataset/path search is not a random string search. Dataset is kept in DBS just like primary dataset and datatier. Why we cannot search on it, but search three different parts of it?

ericvaandering commented 10 years ago

Author: meloam Replying to [comment:13 hufnagel]:

Example, AFAIK we do have common substrings in Primary Datasets and in Skim names. A global search for something like that will return results that contain both. Without some knowledge of what the subparts of the datasetpath actually mean and knowledge of what you are looking for, you are lost.

There is no way to avoid the "garbage in == garbage out" problem here. I just like the restrictive search interfaces more because they force you to think about what you actually want before you type. But I realize this is a bit of a philosophical preference.

If a user doesn't know what he's searching for, they're already boned, requiring them to iterate over:

/ZJets// /_/ZJets/ //_/ZJets

instead of ZJets

won't make it any better, if they don't know how to pick the right option. Also, I honestly don't know off-hand where I need to stick the ZJets to get what I'm looking for to come up. With my luck, the right "partitioned" search string would end up being the last one I found. But I would need to type it three times (or have a fun condition string) to eventually get what I want :/

ericvaandering commented 10 years ago

Author: evansde If this ticket gets any longer you can add the requirement that users specify all dataset paths as ascii character codes. Just pick something vaguely sane and implement it.

You know that /primary/processed/tier will decompose into three bits, and then you dont need to worry about slash placements, so start from that. If the API supports listdatasets(primary, processed, tier) and each one of those has seperate regexp limits or whatever, that is fine. The front end parsing of the dataset path can break that down.

At the end of the day these people have at least a degree in physics and should be capable of reading the documentation.

ericvaandering commented 10 years ago

Author: hufnagel Replying to [comment:16 evansde]:

If this ticket gets any longer you can add the requirement that users specify all dataset paths as ascii character codes. Just pick something vaguely sane and implement it.

Bikeshedding :-)

Users will hate whatever we implement anyways, will need feedback on it to improve.

ericvaandering commented 10 years ago

Author: yuyi stick with what we have.