Open hassan11196 opened 2 months ago
I noticed a very high number of jobs created by this cmsgwms-submit13
agent Today. Roughly 100K more jobs than usual.
I got the list of created
jobs grouped by workflows and found these two ReReco workflows with absurdly high job count.
| name | count(wmbs_job.id) |
+---------------------------------------------------------------------------------------------------+--------------------+
| cmsunified_ACDC0_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240924_052814_4418 | 69473 |
| cmsunified_ACDC0_Run2024F_JetMET1_ECAL_CC_HCAL_DI_240924_052230_5477 | 50666 |
These 2 ACDC workflows created more than 50K jobs each,
while submitting the ACDC for wf1 we had expected the number of jobs to be 11584 (based on the no. of failed jobs), But the estimated job count in ReqMgr is around more than ~100K"TotalEstimatedJobs": 111755
, that is 10x than the expected amount.
Similar is the case for the second ACDC
while submitting the ACDC for wf2 we had expected the number of jobs to be 11604 (based on the no. of failed jobs), But the estimated job count in ReqMgr is around more than ~100K"TotalEstimatedJobs": 111813
, that is 10x than the expected amount.
This seems to be a different issue than described above as this is a discrepancy between failed jobs and TotalEstimatedJobs
, let me know if you think I should open a separate issue for this.
I would appreciate it if we could get some feedback on why the number of jobs for ACDCs can essentially "blow up" and could end up creating such a high no. of jobs in the Agent. How can we (P&R) accurately predict the expected number of jobs better?
As far as I know, agents are not designed to handle such a high number of created jobs for 1 wf as it takes significant disk space.
From P&R's side, these wfs had an extremely high failure rate (97%) so ideally we should have killed and cloned them with Higher memory instead of ACDC. We are making improvements in our systems to catch these kinds of scenarios.
FYI @amaltaro @anpicci
@hassan11196 looking into the job splitting for this ACDC workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_ACDC0_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240924_052814_4418
I see:
"algorithm": "EventAwareLumiBased",
"events_per_job": 2880,
while its original workflow: https://cmsweb.cern.ch/reqmgr2/fetch?rid=pdmvserv_Run2024F_JetMET0_ECAL_CC_HCAL_DI_240830_141234_9115
has the following job splitting:
"algorithm": "EventAwareLumiBased",
"events_per_job": 28800,
To me, it is working as expected and I don't see any problems with that, given that the ACDC workflow has a job splitting 10x smaller, it is expected to create 10x more jobs. Am I missing anything?
Ah, Thank you for pointing that out Alan. splitting was indeed increased by 10x, this indeed explains the increase in Jobs. I will look into including this in our calculations before ACDC submission.
Sorry for the noise and Thank you again for looking into it.
Impact of the bug ReqMgr2
Describe the bug The
TotalEstimatedJobs
calculation is not accurate and does not always correlate with the actual number of jobs created by the agents.For Example:
For this ReReco ACDC1 Workflow, The
TotalEstimatedJobs
is 1,084,175 ~ 1 Million Jobs, while the actual number of jobs created is ~14K jobs, which we can see from wmstats.This number also correlates with the number of failed jobs for ACDC0 of this request i.e ~13K jobs, However, while creating ACDC, splitting was modified to 2x.
TLDR
ACDC1
TotalEstimatedJobs
~= 1 Million Jobs , while actual jobs created ~14K ACDC0 had 13K failed jobs and a totalTotalEstimatedJobs
380K jobs, while actual jobs created ~18K. Original Workflow had 15K failed jobs 53KTotalEstimatedJobs
Jobs, while actual jobs created ~37K.TotalEstimatedJobs
does not match actual number of jobs created.How to reproduce it Not sure
Expected behavior
TotalEstimatedJobs
should match the actual number of jobs created.Additional context and error message I found this discrepancy while implementing protection in Unified for ACDCs with a large amount of failed Jobs.
FYI @amaltaro @haozturk @anpicci