dmwm / DAS

Data Aggregation System
11 stars 7 forks source link

DAS Query: looking up dataset with file name & list files in a group of datasets #935

Closed DMWMBot closed 12 years ago

DMWMBot commented 13 years ago

1.lookup dataset from a file name, a user will need it when tracing info during do analysis via crab. Confirmed by Ian.

dataset file=/store/mc/Winter09/Wgamma/GEN-SIM-DIGI-RECO/IDEAL_V12_FastSim_v1/0000/06C82617-4A15-DE11-B068-0014C23ADC8E.root

  1. lookup files in group of datasets, which currently we can do in phedex and DBS discovery. Confirmed will be the case for data operations by Peter.

file dataset=_IDEAL_V12_FastSimv1

vkuznet commented 13 years ago

valya: 1. This is a big issue, since there is no DBS2 API which can look-up dataset for a given file. I have contacted both DBS developers and L2 (Dave/Simon) and I get feedback that it's not worth an effort in DBS2. Such API will be provided in DBS3.

  1. The usage of pattern is strongly discouraged by DBS developers, since it slows down the entire system, makes SQL query longer, etc. So at API level, the file look-up only done for full dataset path. I am not aware of phedex API which look-up files for dataset pattern. If such exists I would appreciate if someone will let me know about it.
evansde77 commented 13 years ago

evansde: 1. DBS2 is what it is, new features like a reverse file - dataset lookup will be implemented in DBS3, but its a full table scan of the files table in DBS. Given the whole system is designed to streamline queries as dataset->block->file this seems like it could be a proper pain to implement & if its done badly will cause havoc. (DZero's SAM system used to get tied in knots by this kind of reverse lookup)

  1. Wildcards are dangerous and IMO shouldnt be allowed past the API. If we have a limited set of wildcard matches like datasets then maybe, but with the APIs & it should be possible to cache things for a list & search approach with python APIs. Should be trivial to provide some scripts that users can use to filter results and match strings against cached lists of dataset names for example.
drsm79 commented 13 years ago

metson: I think DAS can itself provide 1. since the lfn is a known and fixed structure. E.g. from the filename above /Winte09Wgamma/IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO is the dataset. Is that sort of string parsing possible (it must be, right, even if it's done in the server and not the database)? We could extend Lexicon.py to give you things like this pretty easily.

For 2. DAS should print out some warning about unconstrained queries (which raises a question - I'll open another ticket :))

vkuznet commented 13 years ago

valya: 1. I really against "hacking" solution, when you parse something whose structure can evolve over time and not mandatory enforced. My memory tells me that we discussed that if something will be missing at API level we will add/request it. This is not that much different. It is safer/correct to implement appropriate API then rely on things which can match on string level.

  1. The unconstrained queries must be appropriately handled by APIs as well. DAS is data agnostic and does not have knowledge which data/queries are correct. It is up to data-service to tell, who has such knowledge. So if unconstrained pattern has been passed to API, it should return appropriate error/exception back, which explain why it can't do such request. Then at DAS we can view those errors/exceptions (see #942).
drsm79 commented 13 years ago

metson: Replying to [comment:4 valya]:

  1. I really against "hacking" solution, when you parse something whose structure can evolve over time and not mandatory enforced.

Well, it should be enforced, and in fact the lfn->dataset look up is enforced (what's not is the sub-string contents). I propose a new ticket against WMCore to make the necessary changes to Lexicon.py and then DAS can use that in this case.

drsm79 commented 13 years ago

metson: #943 is for the improvement to Lexicon.py

vkuznet commented 13 years ago

valya: The lfn->dataset look-up is provided at DB level via foreign key constraints. Do not try to bring complexity and irrelevant dependencies when the sub-string parsing can be done at one line of python.

So, I will be more comfortable if we have string matching as intermediate solution and I can do it in one line of python in DAS DBS2 plugin, while once DBS3 will be in place I'll use proper API (it already exists!).

vkuznet commented 13 years ago

valya: Here is a desired 1 line in python:

{{{ dataset = '/' + '/'.join(lfn.split('/')1abac6170aa47fc678084c5b120f4418f513ca4b6]) }}}

drsm79 commented 13 years ago

metson: Replying to [comment:8 valya]:

Here is a desired 1 line in python:

{{{ dataset = '/' + '/'.join(lfn.split('/')1abac6170aa47fc678084c5b120f4418f513ca4b6]) }}}

Which is wrong, since the dataset path is {{{/PrimaryDataset/ProcessedDataset/Tier}}} which is {{{/Winte09Wgamma/IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO}}} and not {{{/Winte09Wgamma/GEN-SIM-DIGI-RECO}}} ;)

I think this is exactly why it needs to be wrapped up in a library that all tools use (e.g. Lexicon.py).

vkuznet commented 13 years ago

valya: I still think we can't do it with simple string parsing. According to this link https://twiki.cern.ch/twiki/bin/viewauth/CMS/DMWMPG_Namespace

each LFN has the following meaning:

{{{ /store/data/acquisition_era/primary-dataset/data_tier/processing_version/lfn_counter/filename.root }}}

so we can't reconstruct out of LFN name the dataset path. Our best effort would be to get primary dataset and tier. As far as I know acquisision_era is not the same as processed dataset name. So if we'll only use primary dataset/tier we will not get proper mapping for given LFN, and instead will get irrelevant hits. Here is an example:

LFN: /store/relval/CMSSW_2_2_9/RelValSinglePiPt10/GEN-SIM-DIGI-RECO/IDEAL_V12_FastSim_v1/0001/CEA80414-AF31-DE11-88E3-000423D94534.root

has the following dataset: /RelValSinglePiPt10/CMSSW_2_2_9_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO. So the LFN names doesn't not contain proper bits to get dataset name.

Another example even more informative. The LFN: /store/Generators/2008/3/29/RelVal-wpln0j-madgraph-1206775152/0009/18E3AA54-88FD-DC11-9DD0-000423D6B444.root

has the following dataset: /wpln0j-madgraph/CMSSW_1_8_3-RelVal-1206775152/GEN which obviously violate convention listed on aforementioned twiki.

So, please be wise and make things right, which means provide proper API.

vkuznet commented 13 years ago

valya: The issue #1 is resolved in 0.5.10 release via usage of DBS2 fake API, which use DBS-QL. So the proper DBS3 API will do a right job to look-up dataset for a given file. The look-up of files for set of datasets has limited support. The fake APIs can do that, while real DBS3 API will have limited support for wild-cards (allow to specify wild-card by the end of the dataset path).