dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Incorrect TotalEstimatedJobs Calculation #12065

Open hassan11196 opened 2 months ago

hassan11196 commented 2 months ago

Impact of the bug ReqMgr2

Describe the bug The TotalEstimatedJobs calculation is not accurate and does not always correlate with the actual number of jobs created by the agents.

For Example:

For this ReReco ACDC1 Workflow, The TotalEstimatedJobs is 1,084,175 ~ 1 Million Jobs, while the actual number of jobs created is ~14K jobs, which we can see from wmstats. image

This number also correlates with the number of failed jobs for ACDC0 of this request i.e ~13K jobs, However, while creating ACDC, splitting was modified to 2x.

TLDR

ACDC1 TotalEstimatedJobs ~= 1 Million Jobs , while actual jobs created ~14K ACDC0 had 13K failed jobs and a total TotalEstimatedJobs 380K jobs, while actual jobs created ~18K. Original Workflow had 15K failed jobs 53K TotalEstimatedJobs Jobs, while actual jobs created ~37K.

TotalEstimatedJobs does not match actual number of jobs created.

How to reproduce it Not sure

Expected behavior TotalEstimatedJobs should match the actual number of jobs created.

Additional context and error message I found this discrepancy while implementing protection in Unified for ACDCs with a large amount of failed Jobs.

FYI @amaltaro @haozturk @anpicci

hassan11196 commented 1 month ago

I noticed a very high number of jobs created by this cmsgwms-submit13 agent Today. Roughly 100K more jobs than usual. image

I got the list of created jobs grouped by workflows and found these two ReReco workflows with absurdly high job count.

| name                                                                                              | count(wmbs_job.id) |
+---------------------------------------------------------------------------------------------------+--------------------+
| cmsunified_ACDC0_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240924_052814_4418                              |              69473 |
| cmsunified_ACDC0_Run2024F_JetMET1_ECAL_CC_HCAL_DI_240924_052230_5477                              |              50666 |

These 2 ACDC workflows created more than 50K jobs each,

while submitting the ACDC for wf1 we had expected the number of jobs to be 11584 (based on the no. of failed jobs), But the estimated job count in ReqMgr is around more than ~100K"TotalEstimatedJobs": 111755, that is 10x than the expected amount.

Similar is the case for the second ACDC while submitting the ACDC for wf2 we had expected the number of jobs to be 11604 (based on the no. of failed jobs), But the estimated job count in ReqMgr is around more than ~100K"TotalEstimatedJobs": 111813, that is 10x than the expected amount.

This seems to be a different issue than described above as this is a discrepancy between failed jobs and TotalEstimatedJobs, let me know if you think I should open a separate issue for this.

I would appreciate it if we could get some feedback on why the number of jobs for ACDCs can essentially "blow up" and could end up creating such a high no. of jobs in the Agent. How can we (P&R) accurately predict the expected number of jobs better?

As far as I know, agents are not designed to handle such a high number of created jobs for 1 wf as it takes significant disk space.

From P&R's side, these wfs had an extremely high failure rate (97%) so ideally we should have killed and cloned them with Higher memory instead of ACDC. We are making improvements in our systems to catch these kinds of scenarios.

FYI @amaltaro @anpicci

amaltaro commented 1 month ago

@hassan11196 looking into the job splitting for this ACDC workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240924_052814_4418

I see:

      "algorithm": "EventAwareLumiBased",
      "events_per_job": 2880,

while its original workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=pdmvserv_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240830_141234_9115

has the following job splitting:

      "algorithm": "EventAwareLumiBased",
      "events_per_job": 28800,

To me, it is working as expected and I don't see any problems with that, given that the ACDC workflow has a job splitting 10x smaller, it is expected to create 10x more jobs. Am I missing anything?

hassan11196 commented 1 month ago

Ah, Thank you for pointing that out Alan. splitting was indeed increased by 10x, this indeed explains the increase in Jobs. I will look into including this in our calculations before ACDC submission.

Sorry for the noise and Thank you again for looking into it.