dmwm / DAS

Data Aggregation System
11 stars 7 forks source link

Need custom DAS map-reduce for Oli use case #547

Closed vkuznet closed 12 years ago

vkuznet commented 13 years ago

Oli wants to have custom views in DAS to get his data:

''Essentially the sum of data for each T1 site for each combination of acq era, tier, custodial/non-custodial. ''

I think it can be accomplished as 2 step procedure in DAS.

  1. DAS ask DBS3/phedex for dataset/block info
    • DAS asks DBS3 for list of all datasets. This brings into DAS tier/era info.
    • DAS asks Phedex for list of all blocks. This brings into DAS block info which contains replicas.
      1. We develop script which loop over all unique tier/era combinations and ask for each of them a sum of replicas from stored blocks.
drsm79 commented 13 years ago

metson: wrong ticket...

vkuznet commented 13 years ago

valya: I had a look at this request. My conclusion that with current set of APIs (both Phedex and DBS2/DBS3) it is impossible via map-reduce, but can be done as external script/application. Here is my observation:

right now phedex returns the following information upon provided block name

{{{

{"das_id": "4cbe02fdf823c63be5000004", "_id": "4cbe02fef823c63be5000011", "block": {"name": "/Wgamma/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO#9f5c396b-b6a1-4efc-aaca-d2193ff1c341", "replica": [{"group": "", "complete": "y", "subscribed": "n", "custodial": "n", "creation_time": 1238191966.3910799, "site": "T1_UK_RAL_Buffer", "modification_time": 1272486414.8509099, "node_id": 18.0, "nfiles": 6.0, "se": "srm-cms.gridpp.rl.ac.uk", "size": 14953555838.0}, {"group": "DataOps", "complete": "y", "subscribed": "y", "custodial": "y", "creation_time": 1238185414.3682899, "site": "T1_UK_RAL_MSS", "modification_time": 1272486414.8509099, "node_id": 19.0, "nfiles": 6.0, "se": "srm-cms.gridpp.rl.ac.uk", "size": 14953555838.0}], "is_open": "n", "nfiles": 6.0, "id": 549575.0, "size": 14953555838.0}, "das": {"empty_record": 0, "expire": 1287521920.614502, "primary_key": "block.name"}}

}}}

so we can sum(block.replica.size) for custodial=y/n and site=T1.

In DBS3 there is not enough information for block to aggregate era/tier/block with Phedex info. Currently DBS3 returns dataset, era, tier for given dataset, e.g.

{{{ [{"is_dataset_valid": 1, "primary_ds_name": "MinBias900GeV", "physics_group_name": "dataOps", "acquisitio n_era_name": null, "create_by": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ceballos/CN=488892/CN=Guille lmo Gomez Ceballos", "dataset_access_type": "VALID", "data_tier_name": "AODSIM", "last_modified_by": "/DC =ch/DC=cern/OU=Organic Units/OU=Users/CN=ceballos/CN=488892/CN=Guillelmo Gomez Ceballos", "creation_date" : 1224317896, "processing_version": null, "processed_ds_name": "Summer08_IDEAL_V9_AODSIM_v1", "global_tag ": null, "xtcrosssection": null, "last_modification_date": 1239728278, "dataset_id": 10512, "dataset": "/ MinBias900GeV/Summer08_IDEAL_V9_AODSIM_v1/AODSIM", "primary_ds_type": "mc"}] }}}

but this output does not contain block info which we need for aggregation. If we ask for a block we get

{{{ [{"block_id": 249378, "create_by": "/DC=ch/DC=cern/OU=Organic Units/OU=Users/CN=ceballos/CN=488892/CN=Guillelmo Gomez Ceballos", "creation_date": 1224324546, "open_for_writing": 0, "dataset": "/MinBias900GeV/Summer08_IDEAL_V9_AODSIM_v1/AODSIM", "block_name": "/MinBias900GeV/Summer08_IDEAL_V9_AODSIM_v1/AODSIM#35e2fa40-7d2d-4f17-84dc-d1bf0821b4fe", "file_count": 219, "origin_site_name": "dbs.test.server", "dataset_id": 10512, "block_size": 352274652233}] }}}

which does not contain any era/tier information even though tier can be deduced from dataset/block name.

So, in order to complete Oli use case I see two solutions:

  1. Find all datasets, loop over dataset to get list of blocks, for each block loop over phedex and sum up its size for given Tier center.
  2. Get all blocks from phedex, such that I can map-reduce block name, custodial, replica size for given Tier Send request to DBS3 to fetch info about dataset for given block in order to get era.

Both solutions seems not appropriate for DAS auto-workflow, since require customized logic, but can be done as stand-alone script. Certainly we can (must) feed such output back to DAS and run it periodically, such that DAS will contain the documents with ''era, tier, sum(block.replica), site'' keys.

gutsche commented 13 years ago

gutsche: Replying to [ticket:547 valya]:

Oli wants to have custom views in DAS to get his data:

''Essentially the sum of data for each T1 site for each combination of acq era, tier, custodial/non-custodial. ''

I think it can be accomplished as 2 step procedure in DAS.

  1. DAS ask DBS3/phedex for dataset/block info
    • DAS asks DBS3 for list of all datasets. This brings into DAS tier/era info.
    • DAS asks Phedex for list of all blocks. This brings into DAS block info which contains replicas.
      1. We develop script which loop over all unique tier/era combinations and ask for each of them a sum of replicas from stored blocks.

Looks good. We currently use a script once a week to extract all needed info from TMDB directly and then use a python script to parse the info. The last step is a manual step in Excel to make tables and plots. All a bit clumsy but it grew over time.

drsm79 commented 13 years ago

metson: How about pulling the era/tier from the dataset or block name? Can you do things like {{{ sum(data.size) where block.name like '_/ACQERA/TIER#*' and site = T1_US_FNAL }}} And just call that for all sites, tiers and era's? The naming convention of datasets should be sufficient to do this from the block name, unless people have been 'inventive'.

vkuznet commented 13 years ago

valya: This is solution #2 in my post, even though query will be slightly different. The point is that we need external step to get all all tiers/eras.

drsm79 commented 13 years ago

metson: I think that's fine - got to have something for cron to do ;)

vkuznet commented 13 years ago

valya: Oli ask me to provide aggregation for the following query:

{{{ dbs search --query="find file,file.size,file.numevents where dataset = GEN-SIM-RAW and file = /store/mc/* and file.createdate >= 2010-01-01 and file.createdate < 2010-01-02" --noheader }}}

My reply to him:

the DBS2 provides listFiles API, whose output is the following:

{{{ <file id='11460378' lfn='/store/mc/Winte09Wgamma/GEN-SIM-DIGI-RECO/IDEAL_V12_FastSim_v1/0000/84FC9EC3-A014-DE11-9909-0017A4ECB031.root' checksum='1774385121' adle27d857db64fa577374e297670732e58a83e56ee4'NOTSET' md5='NOTSET' size='2692183456' queryable_meta_data='NOTSET' number_of_events='36000' block_name='/Wgamma/Winter09_IDEAL_V12_FastSim_v1/GEN-SIM-DIGI-RECO#9f5c396b-b6a1-4efc-aaca-d2193ff1c341'> }}}

as far as I can tell you get file name, size, numevents, but you can't see the file creation date. So it means you need to scan every single file in another query/API to get its creation date.

While in DBS3, everything seems to be in place. Here is an output for test dataset:

{{{ {"check_sum": "1504266448", "branch_hash_id": null, "adle27d857db64fa577374e297670732e58a83e56ee4: null, "block_id": 99, "event_count": 1619, "file_type": "EDM", "create_by": "yuyi", "logical_file_name": "/store/mc/10244/9.root", "creation_date": 1294785350, "last_modified_by": "yuyi", "dataset": "/unittest_web_primary_ds_name_10244/unittest_web_dataset_10244/GEN-SIM-RAW", "block_name": "/unittest_web_primary_ds_name_10244/unittest_web_dataset_10244/GEN-SIM-RAW#10244", "file_id": 7850, "file_size": 2012211901, "last_modification_date": 1294785350, "dataset_id": 135, "file_type_id": 1, "auto_cross_section": 0.0, "md5": null, "is_file_valid": 1} }}}

it is JSON and it lists file name, type, creation_date, size. So using this output it would be possible to aggregate information through DAS in the following way (still need to test though but the syntax is correct and all operators are supported):

{{{ file dataset=/bla* | grep file.name, file.size, file.creation_date>123, file.creation_date<123 | sum(file.size) }}}

Oliver, replied to me:

I played around yesterday a bit and found that if I query for blocks I can get the same information and still be able to parse the block name to distinguish between MC production and reprocessing. I now have a script which extracts the information for 2010: http://cmssw.cvs.cern.ch/cgi-bin/cmssw.cgi/UserCode/Gutsche/GutSoftConfigurations/Scripts/DBS/mc_statistics_2010.sh?view=log

Basically he made the following query:

{{{ find block,block.size,block.numevents,block.createdate where tier = GEN-SIM-RAW* and block.createdate >= 2010-01-01 and block.createdate < 2010-02-01 }}}

and then used {{{ cat mc_2010_janraw.blocks | grep -v '/CMSSW' | grep -vi test | grep -vi preprod | grep -v Jan29 | awk -F\/ '{print $3}' | sort -u }}}

to get the info he wants.

So I need to cover this case once DBS3/DAS integration is ready.

vkuznet commented 13 years ago

valya: Oli,

I made a new service which I hope can do what you want. Since getting the data is a slow operation I want to run it separately from DAS. It will retrieve desired info from DBS/Phedex, but will answer your question in two ways:

  1. If data is not yet populated you/DAS will get a dict with

{{{ {'busy':'please try later', 'reason':'waiting for DBS3/Phedex info'} }}}

which means that you/DAS need to try it later.

  1. if it got all the data you'll get records like this (I asked for dataset info for site=T2_CH_CAF):

{{{ {u'count': 23.0, u'name': u'/QCD_Pt_170to300_TuneZ2_55M_7TeV_pythia6/Fall10-START38_V12-v1/DQM', u'custodial': u'n', u'site': u'T2_CH_CAF', u'nfiles': 348.0, u'se': u'caf.cern.ch', u'size': 1576260039.0} {u'count': 7.0, u'name': u'/WtoTauNu_TuneP0_7TeV-pythia6-tauola/Fall10-START38_V12-v1/DQM', u'custodial': u'n', u'site': u'T2_CH_CAF', u'nfiles': 40.0, u'se': u'caf.cern.ch', u'size': 159781457.0} {u'count': 7.0, u'name': u'/WtoTauNu_TuneProPT0_7TeV-pythia6-tauola/Fall10-START38_V12-v1/DQM', u'custodial': u'n', u'site': u'T2_CH_CAF', u'nfiles': 46.0, u'se': u'caf.cern.ch', u'size': 152100375.0} {u'count': 7.0, u'name': u'/QCD_Pt-80toInf_6GenJets_TuneZ2_7TeV-pythia6/Fall10-START38_V12-v1/DQM', u'custodial': u'n', u'site': u'T2_CH_CAF', u'nfiles': 82.0, u'se': u'caf.cern.ch', u'size': 217423491.0} {u'count': 5.0, u'name': u'/DYtoMuMu_M_20_TuneProPT0_7TeV-pythia6/Fall10-START38_V12-v1/DQM', u'custodial': u'n', u'site': u'T2_CH_CAF', u'nfiles': 7.0, u'se': u'caf.cern.ch', u'size': 50849605.0} }}}

I can supplement records with additional info if you want (I used DBS3/Phedex APIs). The information will be fetched periodically and kept around for some amount of time, e.g. 1 hour.

Once service is in place, DAS can talk to it. But it means that sometimes you can get busy record and need to try your query later. Usually the DBS3/Phedex combination takes around 5 minutes. Once it's completed the look-up of dataset/site would be really fast.

gutsche commented 13 years ago

gutsche: Hi there,

looks good. We'll use it when DBS3 comes online.

Thanks,

OLI

vkuznet commented 13 years ago

valya: The solution has been implemented as of revision b121b75c2b298c17b205559ade26544bdccce56eusing DBS3/Phedex APIs. Closing the ticket.