dmwm / dasgoclient

Data Aggregation System (DAS) Go client
https://cmsweb.cern.ch/das/
MIT License
9 stars 4 forks source link

"jsonparser failure of DAS record" when querying for events in lumisections #32

Closed justinasr closed 2 years ago

justinasr commented 2 years ago

I am trying to get a number of events in lumisections and dasgoclient is throwing an error. Command that I am using:

dasgoclient --query="file,run,lumi,events dataset=/DoubleMuon/Run2018A-v1/RAW run in [315257,315258,315259]"

We used to get output where each line was <filename> <lumis> <events in lumis>, but now there are error messages and <events in lumis> is missing:

jsonparser failure of DAS record={"das":{"expire":1646053218,"instance":"prod/global","primary_key":"file.name","record":1,"services":["dbs3:file_run_lumi_evts4dataset"]},"file":[{"name":"/store/data/Run2018A/DoubleMuon/RAW/....root"}],"lumi":[{"number":[40,43,41,44,46,39,45,42,47]}]}
, select sub keys=[run [0] run_number], error=Key path not found
...
/store/data/Run2018A/DoubleMuon/RAW/....root [40,43,41,44,46,39,45,42,47]
...

I enabled verbose output and noticed that dasgoclient is doing queries to filelumis API, e.g.:

DAS GET https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name=/DoubleMuon/Run2018A-v1/RAW#331526bd-2a34-4024-a2fc-18f56060b288&run_num=315257&run_num=315258&run_num=315259 37.076875ms

and if I try to open that link myself, I get an error message:

"error": {
  "reason": "DBSError Code:114 Description:DBS validation error when wrong pattern is provided Function:dbs.validator.checkBlockHash Message:wrong parts in block name /DoubleMuon/Run2018A-v1/RAW Error: validation error",
  "message": "not str type",
  "function": "dbs.Validate",
   "code": 113
},

Could it be that # of block name is treated as as special symbol in url and not correctly sent to or interpreted by DBS? I think I got some success with manually encoding # to %23, e.g.: https://cmsweb.cern.ch/dbs/prod/global/DBSReader/filelumis?block_name=/DoubleMuon/Run2018A-v1/RAW%23572a6da9-56a5-4854-b1b7-eec20b72a536

dasgoclient version:

Build: git=v02.04.41 go=go1.17.7 date=2022-02-28 14:15:31.180884891 +0100 CET m=+0.009200614
vkuznet commented 2 years ago

@justinasr I confirm the error. It happens with my recent attempt to fix CMSSW IB failures which requires DBS aggregated results. The new DBS server does not provide aggregation by default and we oversee few use-cases. It seems to me that your use-case is one of them.

To properly address the issue first I need to recreate previous output. So far I enabled DBS aggregation by default for file,lumi dataset=XXX run=XXX queries, and added flat to use plain DBS results. WIth this new option I can produce the following:

/afs/cern.ch/user/v/valya/public/dasgoclient/dasgoclient_amd64 --query="file,run,lumi,events dataset=/DoubleMuon/Run2018A-v1/RAW run in [315257,315258,315259]" -noDbsAgg
/store/data/Run2018A/DoubleMuon/RAW/v1/000/315/257/00000/666CEA92-9F49-E811-B1B4-FA163E081C30.root 315257 25 528
...

Does this the right output you were used before? Feel free to try out the new executable since it is in my public area. I need to decide how to move forward. I have two options here:

Once I'll have better idea based on your input I can decide how to move forward and provide proper fix.

justinasr commented 2 years ago

I have tried your executable with -noDbsAgg and it seems to have all the info that we need, just in a slightly different format. Before we used to get only one line per file with lumis being represented as [1,2,3,4] and then corresponding events [10,20,30,40]. Now we get 4 lines with 1 10, 2 20, 3 30, 4 40. Just mentioning, not a problem at all.

What worries me more is that now numbers on our side no longer add up to nice 100%. I am still looking where the problem is - whether it is on our side or not. If I remember correctly, they used to add up exactly.

For context, an example: We take a workflow pdmvserv_Run2018A_DoubleMuon_30Jan2022_UL2018_220204_191216_2749, its input dataset and "LumiList". With dasgoclient we use "LumiList" query for number of events in these runs and lumisections to know the "expected" number of events. Then we compare the sum to events in output dataset and get completion fraction. At the moment I am getting 74767375 of expected events (with your new binary) and 75877207 events in AOD (1.1M too many). I wonder if in this case there is an actual overproduction OR we're incorrectly summing up number if evens in each lumisection. In total, we have 4 workflows where we do such calculations and all of them are off by some number of events.

As for how to move forward - it is up to you, I see no problem adapting to changes, for example, adding -noDbsAgg.

I think this issue could be closed as soon as you add -noDbsAgg or update DBS and example mentioned above is a separate, most likely not dasgoclient issue.

vkuznet commented 2 years ago

ok, thanks for clarification. Let's keep issue open and I'll work on it once time permit (first, I should resolve issue with DBS Migration server reported in a different forum). I'll try to recreate original format with lumis groupped together along run number within given file, and will keep -noDbsAgg option as well. Once code will be ready I'll post an update here.

vkuznet commented 2 years ago

@justinasr , I fixed issue and made new version of dasgoclient. You can grab its executable from release page, version v02.04.42. Meanwhile, CMSSW PR is here (https://github.com/cms-sw/cmsdist/pull/7653). I'll close this ticket now, feel free to re-open if necessary.