some TW actions take very long at times. Improve code ?

dmwm / CRABServer

15 stars 38 forks source link

some TW actions take very long at times. Improve code ? #6775

Open belforte opened 2 years ago

belforte commented 2 years ago

Sometimes DBS DataDiscovery or Splitter run for hours. While rare, there is no clear understanding of why nor control of things when it happens. So we are exposed to problems in case of changing usage patterns. Recent example with DBDataDiscovery was 5+ hours to associate files with corresponding secondary dataset parents in this task 210916_135021:sbaradia_crab_EmulatedTagAndProbe_DYToLL_M-50_112X_mcRun3_2021_realistic_v16-v2_CMSSW_11_2_2_patch1 Should look for ways to speed the code, or return error and force saner inputs or .. whatever. but would be good to be able to put a time limit on every slave action.

Screenshot from 2021-09-17 10-26-34

mmascher commented 2 years ago

I checked the logs of the task, this is the good old secondary dataset matching:

https://github.com/dmwm/CRABServer/blob/master/src/python/TaskWorker/Actions/DBSDataDiscovery.py#L435-L442

I even found an old issue where this was reported: https://github.com/dmwm/CRABServer/issues/5244

mmascher commented 2 years ago

Original Ticket and PR: https://github.com/dmwm/CRABServer/issues/4861 https://github.com/dmwm/CRABServer/pull/4934 Matthias is says the algorithm is N^2, but that does not include this line lumis & secinfos['lumiobj'].

belforte commented 2 years ago

thanks @mmascher

mmascher commented 2 years ago

IIUC the implementation of the & in lumis & secinfos['lumiobj'] is here:

https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/DataStructs/LumiList.py#L184-L212

It contains three nested for loops. Considering these are nested inside two other for loops I can see how and why things can get pretty bad.

I think things can be improved considering & is trying to build the common runLumiList, but CRAB only needs to know if there is an overlap.

belforte commented 2 years ago

yeah.. I am quite confident that a leaner algorithm can be identified. But so fare this is quite rare. I am a bit more puzzled by Splitter taking long time but it is possible that it will not happen anymore after #6691