dmwm / CRABServer

15 stars 37 forks source link

attempt to streamline parent file discovery #8352

Closed davidlange6 closed 2 months ago

davidlange6 commented 2 months ago

After a bit of discussion with @belforte , this is some untested code that shoudl considerably reduce the time spent in checking for overlapping lumis when looking for secondary files.

cmsdmwmbot commented 2 months ago

Can one of the admins verify this patch?

belforte commented 2 months ago

To get a baseline, I timed the use case which gave origin to this https://cms-talk.web.cern.ch/t/crab-task-queued-on-command-submit-for-a-while/39714 i.e. primary dataset: /WtoLNu-4Jets_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer22EEMiniAODv3-124X_mcRun3_2022_realistic_postEE_v1-v2/MINIAODSIM secondary dataset: /WtoLNu-4Jets_TuneCP5_13p6TeV_madgraphMLM-pythia8/Run3Summer22EEDRPremix-124X_mcRun3_2022_realistic_postEE_v1-v2/AODSIM

primary dataset has 6833 files, secondary dataset has 31962 files code is like

for file in primary:  # loop 1
  for file in secondary:  # loop 2
    find parents by matching lumis

I found that each iteration of loop1 takes 15~20 seconds. Of course I killed after a few tens of iterations but all files should be pretty similar.

For a total of the order of 30 hours. (maybe I should have let that task run :grin: !

belforte commented 2 months ago

With new code (from this PR) time for each iteration went down to 0.15~0.20 seconds. A neat x100 improvement. Looking forward to a "reasonable" 20 minutes for the whole match.

:bowing_man:

Onward to validation

belforte commented 2 months ago

I tested on primary: /DoubleMuon/Run2018B-02Apr2020-v1/NANOAOD secondary: /DoubleMuon/Run2018B-17Sep2018-v1/MINIAOD

and got identical results