Closed DMWMBot closed 12 years ago
valya: 1. This is a big issue, since there is no DBS2 API which can look-up dataset for a given file. I have contacted both DBS developers and L2 (Dave/Simon) and I get feedback that it's not worth an effort in DBS2. Such API will be provided in DBS3.
evansde: 1. DBS2 is what it is, new features like a reverse file - dataset lookup will be implemented in DBS3, but its a full table scan of the files table in DBS. Given the whole system is designed to streamline queries as dataset->block->file this seems like it could be a proper pain to implement & if its done badly will cause havoc. (DZero's SAM system used to get tied in knots by this kind of reverse lookup)
metson: I think DAS can itself provide 1. since the lfn is a known and fixed structure. E.g. from the filename above /Winte09Wgamma/IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO is the dataset. Is that sort of string parsing possible (it must be, right, even if it's done in the server and not the database)? We could extend Lexicon.py to give you things like this pretty easily.
For 2. DAS should print out some warning about unconstrained queries (which raises a question - I'll open another ticket :))
valya: 1. I really against "hacking" solution, when you parse something whose structure can evolve over time and not mandatory enforced. My memory tells me that we discussed that if something will be missing at API level we will add/request it. This is not that much different. It is safer/correct to implement appropriate API then rely on things which can match on string level.
metson: Replying to [comment:4 valya]:
- I really against "hacking" solution, when you parse something whose structure can evolve over time and not mandatory enforced.
Well, it should be enforced, and in fact the lfn->dataset look up is enforced (what's not is the sub-string contents). I propose a new ticket against WMCore to make the necessary changes to Lexicon.py and then DAS can use that in this case.
metson: #943 is for the improvement to Lexicon.py
valya: The lfn->dataset look-up is provided at DB level via foreign key constraints. Do not try to bring complexity and irrelevant dependencies when the sub-string parsing can be done at one line of python.
So, I will be more comfortable if we have string matching as intermediate solution and I can do it in one line of python in DAS DBS2 plugin, while once DBS3 will be in place I'll use proper API (it already exists!).
valya: Here is a desired 1 line in python:
{{{ dataset = '/' + '/'.join(lfn.split('/')1abac6170aa47fc678084c5b120f4418f513ca4b6]) }}}
metson: Replying to [comment:8 valya]:
Here is a desired 1 line in python:
{{{ dataset = '/' + '/'.join(lfn.split('/')1abac6170aa47fc678084c5b120f4418f513ca4b6]) }}}
Which is wrong, since the dataset path is {{{/PrimaryDataset/ProcessedDataset/Tier}}} which is {{{/Winte09Wgamma/IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO}}} and not {{{/Winte09Wgamma/GEN-SIM-DIGI-RECO}}} ;)
I think this is exactly why it needs to be wrapped up in a library that all tools use (e.g. Lexicon.py).
valya: I still think we can't do it with simple string parsing. According to this link https://twiki.cern.ch/twiki/bin/viewauth/CMS/DMWMPG_Namespace
each LFN has the following meaning:
{{{ /store/data/acquisition_era/primary-dataset/data_tier/processing_version/lfn_counter/filename.root }}}
so we can't reconstruct out of LFN name the dataset path. Our best effort would be to get primary dataset and tier. As far as I know acquisision_era is not the same as processed dataset name. So if we'll only use primary dataset/tier we will not get proper mapping for given LFN, and instead will get irrelevant hits. Here is an example:
LFN: /store/relval/CMSSW_2_2_9/RelValSinglePiPt10/GEN-SIM-DIGI-RECO/IDEAL_V12_FastSim_v1/0001/CEA80414-AF31-DE11-88E3-000423D94534.root
has the following dataset: /RelValSinglePiPt10/CMSSW_2_2_9_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO. So the LFN names doesn't not contain proper bits to get dataset name.
Another example even more informative. The LFN: /store/Generators/2008/3/29/RelVal-wpln0j-madgraph-1206775152/0009/18E3AA54-88FD-DC11-9DD0-000423D6B444.root
has the following dataset: /wpln0j-madgraph/CMSSW_1_8_3-RelVal-1206775152/GEN which obviously violate convention listed on aforementioned twiki.
So, please be wise and make things right, which means provide proper API.
valya: The issue #1 is resolved in 0.5.10 release via usage of DBS2 fake API, which use DBS-QL. So the proper DBS3 API will do a right job to look-up dataset for a given file. The look-up of files for set of datasets has limited support. The fake APIs can do that, while real DBS3 API will have limited support for wild-cards (allow to specify wild-card by the end of the dataset path).
1.lookup dataset from a file name, a user will need it when tracing info during do analysis via crab. Confirmed by Ian.
dataset file=/store/mc/Winter09/Wgamma/GEN-SIM-DIGI-RECO/IDEAL_V12_FastSim_v1/0000/06C82617-4A15-DE11-B068-0014C23ADC8E.root
file dataset=_IDEAL_V12_FastSimv1