CMSCompOps / WmAgentScripts

CMS Workflow Team Scripts
7 stars 51 forks source link

What do we lack without job overflow? #701

Open haozturk opened 3 years ago

haozturk commented 3 years ago

Unified job overflow mechanism was disabled 3 weeks ago. We are trying to understand what we lack without it for 3 weeks. We did not see an under-utilization in the resources. On the contrary, last week the average production was at a very high level( 213k cores on average). But, the utilization of the resources depends on many things and it's hard to assess the importance of the overflow mechanism by only looking that. Therefore, the question is where should we look at to understand what we miss without overflow? This will determine the priority of its replacement in WMCore.

Here is a log of the module from the times when overflow was enabled: last.log

@drkovalskyi @z4027163 @amaltaro @todor-ivanov Any comment is very much appreciated.

haozturk commented 3 years ago

If anyone has any memory about why it's developed in the first place, that would also help us. In other words, what was the problem in the past, so that this logic was implemented?

Offloading a busy site's jobs to empty neighboring sites makes sense, but we need a monitoring to show how much we gained with this functionality in practice.

haozturk commented 3 years ago

An example task on which overflow was applied:

[INFO]: Task TOP-RunIIFall17DRPremix-00747_0 has 96 running, 97 idle. Action needed: True
secondary is at [u'T1_US_FNAL_Disk', u'T2_CH_CERN']
1
[u'T1_ES_PIC', u'T1_FR_CCIN2P3', u'T1_IT_CNAF', u'T2_EE_Estonia', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_GRIF_LLR', u'T2_FR_IPHC', u'T2_IT_Bari', u'T2_IT_Legnaro', u'T2_IT_Pisa', u'T2_IT_Rome', u'T2_US_UCSD', u'T3_FR_IPNL'] around primary location [u'T2_EE_Estonia', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_IPHC', u'T2_IT_Legnaro', u'T2_IT_Rome', u'T2_US_UCSD']
[u'T0_CH_CERN', u'T1_DE_KIT', u'T1_ES_PIC', u'T1_FR_CCIN2P3', u'T1_IT_CNAF', u'T1_RU_JINR', u'T1_UK_RAL', u'T1_US_FNAL', u'T2_BE_IIHE', u'T2_BE_UCL', u'T2_CH_CERN', u'T2_CH_CERN_HLT', u'T2_CH_CSCS', u'T2_DE_DESY', u'T2_DE_RWTH', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_GRIF_LLR', u'T2_FR_IPHC', u'T2_IT_Bari', u'T2_IT_Legnaro', u'T2_IT_Pisa', u'T2_IT_Rome', u'T2_UK_London_Brunel', u'T2_UK_London_IC', u'T2_UK_SGrid_Bristol', u'T2_UK_SGrid_RALPP', u'T2_US_Caltech', u'T2_US_MIT', u'T2_US_Nebraska', u'T2_US_Purdue', u'T2_US_Vanderbilt', u'T2_US_Wisconsin', u'T3_CH_CERN_HelixNebula', u'T3_CH_CERN_HelixNebula_REHA', u'T3_FR_IPNL', u'T3_UK_London_RHUL', u'T3_UK_SGrid_Oxford', u'T3_US_Baylor', u'T3_US_Colorado', u'T3_US_NERSC', u'T3_US_OSG', u'T3_US_PSC', u'T3_US_Rutgers', u'T3_US_SDSC', u'T3_US_TACC'] aroudn secondary location [u'T1_US_FNAL', u'T2_CH_CERN']
[u'T1_ES_PIC', u'T1_FR_CCIN2P3', u'T1_IT_CNAF', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_GRIF_LLR', u'T2_FR_IPHC', u'T2_IT_Bari', u'T2_IT_Legnaro', u'T2_IT_Pisa', u'T2_IT_Rome', u'T3_FR_IPNL'] for premix
[u'T1_ES_PIC', u'T1_FR_CCIN2P3', u'T1_IT_CNAF', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_GRIF_LLR', u'T2_FR_IPHC', u'T2_IT_Bari', u'T2_IT_Legnaro', u'T2_IT_Pisa', u'T2_IT_Rome', u'T3_FR_IPNL'] that are ready
Extending site whitelist for /cmsunified_task_TOP-RunIIFall17DRPremix-00747__v1_T_200917_084000_6778/TOP-RunIIFall17DRPremix-00747_0 to [u'T1_ES_PIC', u'T1_FR_CCIN2P3', u'T1_IT_CNAF', u'T2_ES_CIEMAT', u'T2_FR_CCIN2P3', u'T2_FR_GRIF_IRFU', u'T2_FR_GRIF_LLR', u'T2_FR_IPHC', u'T2_IT_Bari', u'T2_IT_Legnaro', u'T2_IT_Pisa', u'T2_IT_Rome', u'T3_FR_IPNL'] due to PREMIX_overflow

Kibana monitoring: https://monit-kibana.cern.ch/kibana/goto/d8afae5ba753a8b31abd39baba4fb5b5 I cannot see a difference in the sites executing this task before and after the overflow.

vlimant commented 3 years ago

FYI: No-one write a thousand lines of code and complex logic "just for fun". Most of it came from observation by Ops, report to WMcore developers, and dead-end in taking action other than using JobRouter via unified. We can do an archiving of all WMcore issues dating from 4-5 years ago ; but THAT would be just for fun.

Resource utilization has always been ~great, there is not the focus. overflow is meant to shorten the delivery time, and unstuck some workflows in difficult situations ; the metric is "outliers in completion time" (those samples that people will bite you in the ass about, no matter how fast the rest of production goes ...) Job overflow mostly gives a handle at changing course of a workflow post-assignment, that does exist in wmagent. One would look at workflows progression, identify one high priority that is progressing slowly, and realize that it could run also at a neighboring site : you say "include it in the white list in the first place"? it's not always efficient to do so upfront, and only a necessity to buy "time to completion" (what users want) with "computing efficiency" (what funding agency want).

The various strategies were

PRIM: a workflow would get assigned to one site holding the input dataset (with trustsitelist=false). the dataset gets replicated to other site in the meantime to speed things up. wmagent has no mean of including the new sites (and aaa-neighbors) in the site whitelist ; equalizor does.

PU: very similar here, the minbias is at 1 site only, many workflows get assigned to it (while the MB gets replicated to other sites), creating a bottleneck that only kill/clone would let you solve. equalizor would let the sitewhitelist get larger (modulo some interference on how the MB file list is handled in wmagent @amaltaro can confirm it was made more dynamic)

LHE: a taskchain containing a classical PU digital would get assigned to the sites holding the MB only, while the root task (LHE) could in principle run everywhere on earth. while wmagent has no chance at changing that (@amaltaro sitewhitelist per task is possible ?), equalizor would overflow the LHE step to "anywhere on earth", leaving the sitewhitelist for the rest of the taskchain untouched and functioning.

PREMIX: the sitewhitelist of a digi job within a task chain would be reduced to where the "sim" file got created, and would not benefit of the fact that both the secondary and the "sim" file can be read via xrootd. equalizor extends the sitewhitelist (modulo the presence of an input dataset, see the code for details, I'm too rusty on this).

I might the strength and time to identify workflows that are currently in the "dead end" situation that would be solved by these above rules. It's in the exception, not in the bulk that one will find them.