Closed haozturk closed 2 years ago
Checking running jobs, I see that there is either one LogCollect or Merge job dangling for days. I suspect again this might be related to the high number of requests currently in the system, but anyhow if output is 100% produced, why are waiting for a logcollect or merge job?
I agree that waiting for a LogCollect job is not necessary. However, I think the merge jobs need to be complete before the output is usable. Keeping in mind that I'm not an expert, my understanding was that merge jobs take "too small" output files from individual jobs and turn them into "large enough" output files for the DM system by combining outputs together. My understanding (again possibly incomplete) is that merge jobs take files from /store/unmerged
combine them and then write them to /store/mc
or /store/data
, etc. So, I think skipping that merge job would mean losing the output that job is responsible for merging. Does that make sense?
Do we see any indication of why these merge jobs are not running? Are they just getting crowded out by other work? I'm not sure how prioritization with merge jobs works.
Hi,
(just observation) Not sure if this is related (or even correct and I might be wrong) - but afair - JobPrio for Merge/Cleanup/LogCollect used to be always very high value so that it runs faster than anything else. Looking at the global pool right now seems some of the Merge/Cleanup/LogCollect jobs are lower prior than Production/Processing jobs, so this might starve them from running and finishing workflow.
condor_q -const 'JobStatus=?=1' -g -af CMS_JobType JobPrio JobStatus | sort | uniq -c
7131 Analysis 0 1
39454 Analysis 10 1
9079 Analysis 600000 1
1 Analysis 99999 1
4 Cleanup 10110002 1
1 Cleanup 10110004 1
1 Cleanup 10110005 1
3 Cleanup 10111000 1
2 Cleanup 10120002 1
154 Cleanup 110002 1
10 Cleanup 110003 1
7 Cleanup 110004 1
4 Cleanup 110005 1
1 Cleanup 110006 1
12 Cleanup 110007 1
1 Cleanup 110009 1
1 Cleanup 110111 1
11 Cleanup 111001 1
11 Cleanup 111002 1
12 Cleanup 111102 1
11 Cleanup 111201 1
1 Cleanup 111301 1
3 Cleanup 112005 1
5 Cleanup 112201 1
7 Cleanup 120001 1
1 Cleanup 120002 1
1 Cleanup 125001 1
3 Cleanup 126001 1
6 Cleanup 150001 1
2 Cleanup 200001 1
1 Cleanup 85028 1
2 Cleanup 90007 1
40 Express 260000 1
1 Harvesting 260005 1
1 LogCollect 110002 1
8 LogCollect 110003 1
10 LogCollect 110004 1
16 LogCollect 110005 1
14 LogCollect 110006 1
28 LogCollect 110008 1
2 LogCollect 110009 1
3 LogCollect 110010 1
1 LogCollect 110087 1
18 LogCollect 111002 1
33 LogCollect 111202 1
2 LogCollect 111302 1
1 LogCollect 112002 1
3 LogCollect 112006 1
1 LogCollect 120003 1
1 LogCollect 120004 1
1 LogCollect 126002 1
2 LogCollect 150002 1
2 LogCollect 200002 1
14 LogCollect 20110002 1
2 LogCollect 20110004 1
2 LogCollect 20110006 1
12 LogCollect 20111000 1
4 LogCollect 20120002 1
46 LogCollect 251002 1
1 LogCollect 63002 1
1 LogCollect 90008 1
29 Merge 110004 1
36 Merge 110005 1
3 Merge 110007 1
7 Merge 110008 1
2 Merge 110009 1
7 Merge 110010 1
1 Merge 110011 1
2 Merge 110012 1
1 Merge 110017 1
2 Merge 110089 1
9 Merge 111004 1
1 Merge 112004 1
1 Merge 112204 1
177 Merge 113009 1
1 Merge 120005 1
1 Merge 200004 1
15 Merge 251004 1
2 Merge 260004 1
1 Merge 40110003 1
2 Merge 40110004 1
3 Merge 40110005 1
2 Merge 40112000 1
1 Merge 85025 1
241 Processing 110000 1
5092 Processing 110001 1
29142 Processing 110002 1
516 Processing 110003 1
232 Processing 110004 1
39 Processing 110005 1
99 Processing 110006 1
7 Processing 110007 1
22 Processing 110008 1
22 Processing 110009 1
550 Processing 110013 1
5 Processing 110014 1
12 Processing 110015 1
245 Processing 110080 1
157 Processing 110085 1
709 Processing 110110 1
9 Processing 110115 1
30369 Processing 111000 1
1788 Processing 111200 1
3 Processing 111300 1
1058 Processing 112000 1
3 Processing 112200 1
2134 Processing 113000 1
45 Processing 113001 1
113 Processing 125000 1
1388 Processing 150000 1
1 Processing 200000 1
1 Processing 240004 1
3 Processing 251000 1
1 Processing 440012 1
7 Processing 500000 1
13 Processing 85017 1
80 Processing 85020 1
590 Processing 85023 1
705 Processing 85025 1
4166 Processing 85026 1
47 Processing 85027 1
2 Processing 90005 1
148 Processing 90006 1
16545 Production 110000 1
96758 Production 110001 1
16706 Production 110002 1
7227 Production 113000 1
7185 Production 113001 1
148 Production 113006 1
2 Production 300000 1
475 Production 85018 1
25 Production 85020 1
382 Production 85022 1
10519 Production 85023 1
13874 Production 85024 1
5618 Production 85025 1
35336 Production 85026 1
32 undefined 0 1
Also on gwmsmon - looks that something has change a few weeks ago and prior distribution for most of the jobs goes into one range (~110k)
I agree that waiting for a LogCollect job is not necessary. However, I think the merge jobs need to be complete before the output is usable. Keeping in mind that I'm not an expert, my understanding was that merge jobs take "too small" output files from individual jobs and turn them into "large enough" output files for the DM system by combining outputs together. My understanding (again possibly incomplete) is that merge jobs take files from /store/unmerged combine them and then write them to /store/mc or /store/data, etc. So, I think skipping that merge job would mean losing the output that job is responsible for merging. Does that make sense?
Keeping in mind that I'm not an expert, either, according to my understanding if an output is 100% produced, then all merge jobs should already have been done and successful. So, if I am not wrong (probably I am), there should not be any running merge jobs at all if we have the output completely. On the other hand, It makes sense to me to have LogCollect jobs at this stage, but they should not hold us back this long.
Checking one of the request w/ condor_q
, I do not see any job at all, though:
$ condor_q -constr 'WMAgent_RequestName =?= "pdmvserv_task_EXO-RunIISummer20UL16MiniAODAPVv2-00155__v1_T_210602_131151_4999"'
-- Schedd: vocms0284.cern.ch : <137.138.62.223:4080?... @ 07/14/21 10:57:19
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
Total for query: 0 jobs; 0 completed, 0 removed, 0 idle, 0 running, 0 held, 0 suspended
Total for all users: 71809 jobs; 0 completed, 330 removed, 68624 idle, 1263 running, 1592 held, 0 suspended
gwmsmon shows 1 LogCollect: https://cms-gwmsmon.cern.ch/prodview/pdmvserv_task_EXO-RunIISummer20UL16MiniAODAPVv2-00155__v1_T_210602_131151_4999
Not sure which to believe.
Just to comment on gwmsmon - it is correct, let me explain: Go and filter workflow here: https://cms-gwmsmon.cern.ch/prodview And if the workflow is not in the list - it is not in the system, If you go directly to https://cms-gwmsmon.cern.ch/prodview/pdmvserv_task_EXO-RunIISummer20UL16MiniAODAPVv2-00155__v1_T_210602_131151_4999 you can see on the 'last week' diagram it was running on Friday - and after that empty graph - means no jobs are idle/running on the system (otherwise there would be red/blue lines). And also Last Update date: '7/9/2021, 11:49:28 PM'. I agree, section of Subtask still showing up a job is confusing. With a small change - this could be fixed under gwmsmon and not show jobs not on the system. gmwsmon keeps this historical information on individual workflow pages for 30days. This means if no update comes for this workflow by 8/8/2021 - it will delete that history page for good.
So we have potentially two issues here?
a) Requests whose outputs are 100% produced are still in running-closed, the original issue
b) what Justas discovered: auxiliary jobs having lower priorities than other production work in the system. If this is the case, this should never happen, IMO, as it would apparently block workflows from getting done. Who should be correcting this, @amaltaro?
Before we start debugging it, I'd suggest to first figure whether it's a general issue or likely related to a sub-set of agents and/or workflows. So, first question is, are other requests getting into the completed status? According to Grafana monitoring, there seem to be requests going into completed
status, so it must be something related to specific agents.
The running-closed to completed transition is likely performed by this ReqMgr2 CherryPy thread: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/ReqMgr/CherryPyThreads/StatusChangeTasks.py
executed in CMSWEB as reqmgr2-tasks
instance.
Once you identify a workflow or two to be investigated, I'd suggest to:
running-closed
. Why 100% of the output produced though, this is yet to be figured out then.SELECT wmbs_job_state.name, count(wmbs_job.id)
FROM wmbs_job
INNER JOIN wmbs_jobgroup ON
wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON
wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_sub_types ON
wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_job_state ON
wmbs_job.state = wmbs_job_state.id
INNER JOIN wmbs_workflow ON
wmbs_subscription.workflow = wmbs_workflow.id
WHERE wmbs_workflow.name = 'pdmvserv_task_HIG-RunIIWinter15wmLHE-00561__v1_T_151207_101950_7247'
GROUP BY wmbs_job_state.name;
SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType
FROM wmbs_subscription
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
WHERE wmbs_workflow.name='sagarwal_task_TOP-RunIIFall18wmLHEGS-00109__v1_T_181105_200726_3792'
AND wmbs_fileset.open = 1
AND wmbs_subscription.finished = 0
ORDER BY fsetName;
Regarding the auxiliary jobs with lower priority. That should never happened, or at least it should never happen from the agent side. Of course, someone can always ssh to the agent and run a condor_qedit to change those... When someone changes an active workflow priority, this is seen by JobUpdater which runs this code: https://github.com/dmwm/WMCore/blob/master/src/python/WMComponent/JobUpdater/JobUpdaterPoller.py#L155-L166
which in the end executes this code: https://github.com/dmwm/WMCore/blob/d66237302581b8d6b6a41295d82bb64c95a5296c/src/python/WMCore/BossAir/Plugins/SimpleCondorPlugin.py#L467
which takes into consideration the job type (task type) and it's priority factor. As one can see here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/CreateWMBSBase.py#L504-L505
Production/Processing should always - no matter the workflow priority - be lower priority than merge/logcollect/etc.
Last but not least, workflow completion does NOT wait for auxiliary jobs to be completed. Only Production/Processing/Merge/Harvesting jobs are considered for completing a workflow or not.
Thanks a lot @amaltaro for the detailed explanation. Before diving into debugging, I made an impact analysis and I see that 210 workflows got affected by this issue. I attach the list here. I compared the LFN bases of these 210 workflows w/ the unmerged files at Caltech (one of the sites where unmerged usage is high), but I discovered that these 210 workflows are responsible for only 178 files out of 108k. Probably, completing these requests will save space on other grid sites, not much at Caltech. So, I'd suggest manually completing these requests for the sake of unmerged issue. Any objections? I'll double check the list before the operation. So far, it looks good, though. Probably, leaving a few requests might help to debug the issue.
Commenting on item 1 of Alan's roadmap: There are WQEs in Running and so far I spotted only one agent related to this issue: vocms0284.cern.ch
(I didn't check all the requests, only 7-8)
completeWorkflowsWMCoreBug#10672.txt
FYI @drkovalskyi @nsmith-
I am hesitant to force-complete these 210 workflows due to the issue w/ force-completions https://github.com/dmwm/WMCore/issues/10675 I think we should fix these issues automatically. Manual operation does not seem very promising.
I force-completed one request to check if it will be affected by #10675 or not. If it succeeds, I'll force-complete all the workflows affected by this issue to decrease the stress on the system. https://cmsweb.cern.ch/reqmgr2/fetch?rid=cmsunified_task_EXO-RunIIFall17wmLHEGS-03554__v1_T_210628_132241_8381
Thanks @haozturk
OK it's been over 24 hrs and the workflow that was force completed is still sitting in force complete waiting for a merge job to run @todor-ivanov This is next on your list of priority, doesn't pay to force complete if it doesn't work.
While going through the agents I just noticed an error which maybe worth mentioning in this issue here [1]. This may be somehow related with the not properly updated priorities of the merge jobs. So far the error I found in one agent only: vocms0257. Looking deeper now.
[1]
2021-07-22 22:59:10,485:140157693433600:ERROR:SimpleCondorPlugin:Failed to update JobPrio for WMAgent_SubTaskName=/pdmvserv_task_B2G-RunIISummer15wmLHEGS-04984__v1_T_210429_122736_8292/B2G-RunIISummer15wmLHEGS-04984_0/B2G-RunIISummer16DR80Premix-05239_0/B2G-RunIISummer16DR80Premix-05239_1/B2G-RunIISummer16MiniAODv3-06090_0/B2G-RunIISummer16MiniAODv3-06090_0MergeMINIAODSIMoutput
2021-07-22 22:59:10,486:140157693433600:ERROR:SimpleCondorPlugin:Unable to edit jobs matching constraint
Traceback (most recent call last):
File "/data/srv/wmagent/v1.4.7.patch3/sw/slc7_amd64_gcc630/cms/wmagent/1.4.7.patch3/lib/python2.7/site-packages/WMCore/BossAir/Plugins/SimpleCondorPlugin.py", line 486, in updateJobInformation
schedd.edit(constraint, 'JobPrio', classad.Literal(newPriority))
File "/data/srv/wmagent/v1.4.7.patch3/sw/slc7_amd64_gcc630/external/condor/8.9.7-comp/lib/python2.7/site-packages/htcondor/_lock.py", line 75, in wrapper
rv = func(*args, **kwargs)
RuntimeError: Unable to edit jobs matching constraint
the workflow Hasan tried to force complete is still stuck in force complete with the same job in status queued on vocms0284.cern.ch
It should be noted that 284 has been in this mode for several days: agent: vocms0284.cern.ch (1.4.7.patch2) agent last updated: 2021/7/23 (Fri) 19:24:22 UTC : 0 h 2 m data last updated: 1 h 19 m status: Data collect error; team: production data collect error: (28, 'Operation timed out after 270001 milliseconds with 0 bytes received')
a complete restart of the agent did nothing
@haozturk I believe this problem was caused by the same issue reported and resolved in https://github.com/dmwm/WMCore/issues/10675
I'm keeping this open because I want to check the job priority (possible) issue that has been reported here as well.
Thanks Alan - yes, I see that workflows that I reported in the issue description are moving. Let's keep it couple of hours open, then I can check if there is workflow left which was affected by this issue. If not, we can close the issue.
@amaltaro are we sure that this is only related to vocms0284. Please check the following request: https://dmytro.web.cern.ch/dmytro/cmsprodmon/workflows.php?prep_id=task_SMP-RunIISummer20UL18NanoAODv9-00048 It's 100% for 2 days, but not handled by that agent.
It has WQEs in Running state in both submit7 and submit8: https://cmsweb.cern.ch/couchdb/workqueue/_design/WorkQueue/_rewrite/elementsInfo?request=pdmvserv_task_SMP-RunIISummer20UL18NanoAODv9-00048__v1_T_210720_100722_78
and those agents do not have the same problem as vocms0284 had. I will have a deeper look though to make sure those agents are working properly as well.
Yes, I've seen those WQEs, but it did not make sense to me to have WQEs in Running
state while the output is 100%
@haozturk Hasan, here is a likely reason for these workflows taking longer than expected to go to completed: https://github.com/dmwm/WMCore/issues/8477#issuecomment-887411403
Note that this is not a permanent problem, just that that specific thread is under-performing. I'm monitoring those agents for the next few hours to evaluate whether the issue 8477 will need to be re-prioritized or not.
@haozturk I'm changing the title of this GH issue because the main root cause has been fixed. The second issue is mentioned above, with an under-performing agent thread that has its own issue to be addressed in #8477
Thanks to @juztas , I can confirm that indeed Merge jobs are being prioritized after the workflow absolute priority number, without taking into consideration the task priority factor (job type), as just queried in submit7 [&]. I think this is a serious issue, which can lead to a) data loss; and b) workflow completion delay; so I'd like to address this merge issue in this GH issue.
[&]
[cmsdataops@cmsgwms-submit7 current]$ condor_q -const 'CMS_JobType =?= "Merge" && JobStatus =?= 1' -af:h WMAgent_SubTaskName JobPrio > alan
[cmsdataops@cmsgwms-submit7 current]$ head alan
WMAgent_SubTaskName JobPrio
/pdmvserv_Run2018A_EGamma_UL2018_MiniAODv2_NanoAODv9_210726_092815_5804/DataProcessing/DataProcessingMergeNANOEDMAODoutput 112204
/pdmvserv_task_BTV-RunIISummer20UL17NanoAODv9-00018__v1_T_210718_184611_6255/BTV-RunIISummer20UL17NanoAODv9-00018_0/BTV-RunIISummer20UL17NanoAODv9-00018_0MergeNANOEDMAODSIMoutput 110114
/pdmvserv_Run2018A_EGamma_UL2018_MiniAODv2_NanoAODv9_210726_092815_5804/DataProcessing/DataProcessingMergeNANOEDMAODoutput 112204
/pdmvserv_task_B2G-RunIISummer15wmLHEGS-04964__v1_T_210429_084712_982/B2G-RunIISummer15wmLHEGS-04964_0/B2G-RunIISummer16DR80Premix-05187_0/B2G-RunIISummer16DR80Premix-05187_1/B2G-RunIISummer16MiniAODv3-06039_0/B2G-RunIISummer16MiniAODv3-06039_0MergeMINIAODSIMoutput 110009
/pdmvserv_task_B2G-RunIISummer15wmLHEGS-04962__v1_T_210429_084655_4123/B2G-RunIISummer15wmLHEGS-04962_0/B2G-RunIISummer16DR80Premix-05191_0/B2G-RunIISummer16DR80Premix-05191_1/B2G-RunIISummer16MiniAODv3-06043_0/B2G-RunIISummer16MiniAODv3-06043_0MergeMINIAODSIMoutput 110009
/pdmvserv_task_B2G-RunIISummer20UL17NanoAODv9-00014__v1_T_210717_141920_7421/B2G-RunIISummer20UL17NanoAODv9-00014_0/B2G-RunIISummer20UL17NanoAODv9-00014_0MergeNANOEDMAODSIMoutput 110114
/pdmvserv_Run2016F_BTagMu_UL2016_MiniAODv2_NanoAODv9_210726_165149_7078/DataProcessing/DataProcessingMergeNANOEDMAODoutput 112204
while we work on this Merge priority update issue, I think we should also take the opportunity to make a feature change, as requested in this GH issue: https://github.com/dmwm/WMCore/issues/10445 this would be a very good improvement.
Well, even though Merge jobs are prioritized the same as the Production/Processing jobs, it looks like they still manage to get slots pretty quickly. I ran condor_qedit
in all our currently active agents, and more than half of them did not have any jobs to be updated:
cmst1@vocms0257:/data/srv/wmagent/current $ condor_qedit -const 'CMS_JobType =?= "Merge" && JobStatus =?= 1' JobPrio 4000000
Failed to set attribute "JobPrio" by constraint: ( (CMS_JobType =?= "Merge" && JobStatus =?= 1) )
Queue transaction failed. No attributes were set.
while others had only a handful of jobs to be updated.
We noticed the same fact when we found the unexpectedly low priority merge jobs: that they still run quite easily. We suspect maybe it is because they only request a single core and can match slots easier than the usual 4 or 8-core taskchains. In some sense it is very lucky we don't have a giant queue of merge jobs, on the other hand that would be a much more transparent culprit if it were true.
As chatted with @khurtado , he will probably manage to work on this issue today/tomorrow. Kenyi, please consider the suggestion made on this GH issue (and move it to Q3 if you start covering it, please): https://github.com/dmwm/WMCore/issues/10445
@khurtado haven't we fixed this issue with https://github.com/dmwm/WMCore/pull/10760 ?
If so, could you please have a last look at the pool and check whether Merge
jobs now have a meaningful priority (>=40M)? If things look okay in the condor pool, I'd suggest to close this issue.
@amaltaro I queried the the global pool just now and did not find any Merge
job with a JobPrio below 40M
[khurtado@vocms0254 ~]$ condor_q -global -constraint 'CMS_JobType == "Merge"' -af:h JobPrio | sort | uniq -c
4 40085025
2 40085029
7 40085030
295 40085031
3 40085032
20 40085034
5 40110000
1 40110003
46 40180002
2 40180014
3 40200000
2 40360008
1 40444800
Closing this issue
Impact of the bug Whoever is responsible from
running-closed
tocompleted
transition - WMAgent, I believeDescribe the bug This is reported by PdmV on Monday. https://twiki.cern.ch/twiki/bin/view/CMS/CompOpsMeeting?rev=1698
List of such workflows: https://cms-pdmv.cern.ch/pmp/historical?r=RunIISummer20UL16MiniAODAPVv2,RunIISummer20UL16MiniAODv2&sortSubmittedOn=percentage&sortSubmittedOrder=-1
Here are some examples:
I checked out several requests and I see that their outputs were completely produced days, and for some weeks ago. Checking running jobs, I see that there is either one LogCollect or Merge job dangling for days. I suspect again this might be related to the high number of requests currently in the system, but anyhow if output is 100% produced, why are waiting for a logcollect or merge job?
How to reproduce it Check the requests above.
Expected behavior If the outputs are fully produced, we should not delay the completion of the workflows.
Current issue can be fixed manually, but I'd like to understand why the system behaves this way and fix it automatically if possible.
Additional context and error message @nsmith- @drkovalskyi @jenimal @z4027163 FYI