Closed amaltaro closed 1 year ago
I didn't check everything, but zooming in in a ParentlessMergeBySize for the following subscriptions:
47068 /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0MergeMINIAODSIMoutput
we have one file available for job creation, see:
> select * from wmbs_sub_files_available where subscription=47068;
subscription fileid
47068 7150962
The way files are loaded for this subscription algorithm is done through this DAO: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Subscriptions/GetFilesForParentlessMerge.py#L25
which returns an empty list. That means, one of those constraints is not being fulfilled... and here it is, that file id does not have any location:
> select * from wmbs_file_location where fileid=7150962;
CONCLUSION: jobs are not getting created because output unmerged files are inserted into the database without any location, thus keeping that subscription - and its dependencies - stuck. For vocms0282, we have a total of 6 files available without any location, explaining those 3 workflows still stuck in this agent. Query to be used can be:
SELECT wsfa.* FROM wmbs_sub_files_available wsfa
LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
WHERE wfl.fileid is NULL;
but to see the task name and subscription type, one can expand it to:
SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
WHERE wfl.fileid is NULL;
subscription fileid name task
58577 12982636 Cleanup /cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586/BPH-RunIISummer20UL16GEN-00127_0/BPH-RunIISummer20UL16MiniAODv2-00169_0CleanupUnmergedMINIAODSIMoutput
47067 7150962 Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0CleanupUnmergedMINIAODSIMoutput
65217 12841725 Cleanup /cmsunified_task_BPH-RunIISummer20UL18GEN-00146__v1_T_220316_105630_9389/BPH-RunIISummer20UL18GEN-00146_0/BPH-RunIISummer20UL18MiniAODv2-00211_0CleanupUnmergedMINIAODSIMoutput
58578 12982636 Merge /cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586/BPH-RunIISummer20UL16GEN-00127_0/BPH-RunIISummer20UL16MiniAODv2-00169_0MergeMINIAODSIMoutput
47068 7150962 Merge /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0MergeMINIAODSIMoutput
65218 12841725 Merge /cmsunified_task_BPH-RunIISummer20UL18GEN-00146__v1_T_220316_105630_9389/BPH-RunIISummer20UL18GEN-00146_0/BPH-RunIISummer20UL18MiniAODv2-00211_0MergeMINIAODSIMoutput
To move forward with this agent draining, I will set any location - actually the most common for those workflows - for those 3 files waiting to be merged.
Now that we know what the problem really is, I thought it would be better to repurpose this issue (instead making it a one agent problem).
And this issue happened again, this time on vocms0255, for this workflow: cmsunified_task_BPH-RunIISummer20UL16GENAPV-00121__v1_T_220316_105601_9405
I copied all the tarballs from JobArchiver and run this script: https://github.com/amaltaro/ProductionTools/blob/master/untarLogArchive.py
on all of them. Without any success in finding the LFN that didn't have a PNN associated with, which is:
SQL> select * from wmbs_sub_files_available where subscription=30;
SUBSCRIPTION FILEID
------------ ----------
30 892671
SQL> select lfn from wmbs_file_details where id=892671;
ID LFN FILESIZE EVENTS FIRST_EVENT M
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
892671 /store/unmerged/RunIISummer20UL16MiniAODAPVv2/LambdaBTopKMuMu_TuneCP5_13TeV-pythia8-evtgen/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v2/2550004/237CF230-F455-4C46-BEE4-C0A8CF759F70.root 5525759 78 2125229681 0
This node was running out of space for many days and I had to actively delete data to make room, perhaps one of that information deleted contained this output file... anyhow, I will get back to this next time it occurs.
And this happened to vocms0256 as well, for the following file:
select lfn from wmbs_file_details where id=8374072;
/store/unmerged/RunIISummer20UL16MiniAODv2/ZH_HToZZTo4Q_2LFilter_M2500_TuneCP5_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v3/2560008/B0D156FB-7D6B-B141-971D-29518395A913.root
and these are the subscriptions:
SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
WHERE wfl.fileid is NULL;
SUBSCRIPTION FILEID NAME TASK
----------------------------------------------------------------------------------------------------------------------------------------------------------------
46109 8374072 Cleanup /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-00772__v1_T_220602_160150_1846/HIG-RunIISummer20UL16wmLHEGEN-00772_0/HIG-RunIISummer20UL16MiniAODv2-04037_0CleanupUnmergedMINIAODSIMoutput
46110 8374072 Merge /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-00772__v1_T_220602_160150_1846/HIG-RunIISummer20UL16wmLHEGEN-00772_0/HIG-RunIISummer20UL16MiniAODv2-04037_0MergeMINIAODSIMoutput
and again I untar'ed all the log tarballs under the JobArchiver component, and yet again it hasn't found any occurrences of that file UID.
I think we will have to start logging the file names produced by a job in the logs, such that we can eventually track this down.
And we have this same problem with vocms0253 as well. Here is a list of subscriptions with files without any location:
SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
WHERE wfl.fileid is NULL;
SUBSCRIPTION FILEID NAME TASK
----------------------------------------------------------------------------------------------------------------------------------------------------------------
42287 7942382 Cleanup /cmsunified_task_EXO-RunIISummer20UL17wmLHEGEN-02150__v1_T_220727_175210_5336/EXO-RunIISummer20UL17wmLHEGEN-02150_0/EXO-RunIISummer20UL17MiniAODv2-02446_0CleanupUnmergedMINIAODSIMoutput
42288 7942382 Merge /cmsunified_task_EXO-RunIISummer20UL17wmLHEGEN-02150__v1_T_220727_175210_5336/EXO-RunIISummer20UL17wmLHEGEN-02150_0/EXO-RunIISummer20UL17MiniAODv2-02446_0MergeMINIAODSIMoutput
44786 8436720 Merge /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0/HIG-RunIISummer20UL16MiniAODAPVv2-07800_0MergeMINIAODSIMoutput
44785 8436720 Cleanup /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0/HIG-RunIISummer20UL16MiniAODAPVv2-07800_0CleanupUnmergedMINIAODSIMoutput
45127 8988497 Cleanup /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09890__v1_T_220720_133208_1929/HIG-RunIISummer20UL16wmLHEGENAPV-09890_0/HIG-RunIISummer20UL16MiniAODAPVv2-08417_0CleanupUnmergedMINIAODSIMoutput
45128 8988497 Merge /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09890__v1_T_220720_133208_1929/HIG-RunIISummer20UL16wmLHEGENAPV-09890_0/HIG-RunIISummer20UL16MiniAODAPVv2-08417_0MergeMINIAODSIMoutput
with the same pattern as the one above, one Merge and one Cleanup task for each workflow.
I zoomed in in the following workflow: cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946
and I could not file any traces of this file
/store/unmerged/RunIISummer20UL16MiniAODAPVv2/NMSSM_XToYHTo2Tau2B_MX-3000_MY-2400_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v2/2530000/B4E4A367-EFF8-2B4F-9EC4-3A7C28B7EC00.root
in the job logs from JobArchiver (wmagentJob.log) neither from any of the tarballs uploaded to EOSCMS: https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0
To be able to complete these workflows, I am going to set a location for those files, trying to pick the most common location. For the workflow above, this is what I did in the database:
INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 8436720, (SELECT id from wmbs_pnns where pnn='T1_DE_KIT_Disk') FROM DUAL;
the other two have been updated with (note that the location might not be the correct one):
INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 7942382, (SELECT id from wmbs_pnns where pnn='T2_US_Purdue') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 8988497, (SELECT id from wmbs_pnns where pnn='T2_US_Wisconsin') FROM DUAL;
As for better debugging it, I think we need to record in the JobAccountant log all the output files created by a given job.
Another update on this, this time on submit5 with the following workflow stuck in running-closed: cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662
Performing the same procedure as in the previous workflows, querying for files without any location in WMBS, gives us this:
MariaDB [wmagent]> SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
-> LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
-> INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
-> INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
-> INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
-> WHERE wfl.fileid is NULL;
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid | name | task |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 19099 | 1357104 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662/HIG-RunIISummer20UL17wmLHEGEN-00638_0/HIG-RunIISummer20UL17MiniAODv2-06201_0CleanupUnmergedMINIAODSIMoutput |
| 19100 | 1357104 | Merge | /cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662/HIG-RunIISummer20UL17wmLHEGEN-00638_0/HIG-RunIISummer20UL17MiniAODv2-06201_0MergeMINIAODSIMoutput |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)
FNAL and NERSC seem to be popular destinations for this workflow, so here we go defining that file as available in that location:
MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 1357104, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk');
with no need to commit these changes, given that MariaDB has auto-commit already enabled.
And I found 3 workflows stuck in submit6, as follows
+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid | name | task |
+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 35719 | 5596650 | Cleanup | /pdmvserv_task_SUS-RunIIFall17FSPremix-00207__v1_T_220929_170653_8282/SUS-RunIIFall17FSPremix-00207_0/SUS-RunIIFall17FSPremix-00207_0MergeAODSIMoutput/SUS-RunIIFall17MiniAODv2-00642_0/SUS-RunIIFall17MiniAODv2-00642_0CleanupUnmergedMINIAODSIMoutput |
| 35720 | 5596650 | Merge | /pdmvserv_task_SUS-RunIIFall17FSPremix-00207__v1_T_220929_170653_8282/SUS-RunIIFall17FSPremix-00207_0/SUS-RunIIFall17FSPremix-00207_0MergeAODSIMoutput/SUS-RunIIFall17MiniAODv2-00642_0/SUS-RunIIFall17MiniAODv2-00642_0MergeMINIAODSIMoutput |
| 74269 | 9644986 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-11204__v1_T_220908_160011_6520/HIG-RunIISummer20UL16wmLHEGENAPV-11204_0/HIG-RunIISummer20UL16MiniAODAPVv2-09387_0CleanupUnmergedMINIAODSIMoutput |
| 74270 | 9644986 | Merge | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-11204__v1_T_220908_160011_6520/HIG-RunIISummer20UL16wmLHEGENAPV-11204_0/HIG-RunIISummer20UL16MiniAODAPVv2-09387_0MergeMINIAODSIMoutput |
| 89125 | 10690502 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/HIG-RunIISummer20UL16wmLHEGEN-11893_0/HIG-RunIISummer20UL16MiniAODv2-10758_0CleanupUnmergedMINIAODSIMoutput |
| 89126 | 10690502 | Merge | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/HIG-RunIISummer20UL16wmLHEGEN-11893_0/HIG-RunIISummer20UL16MiniAODv2-10758_0MergeMINIAODSIMoutput |
+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.008 sec)
luckily, this time I could find a log tarball for the _4996 workflow above, where this file has no location:
MariaDB [wmagent]> select lfn from wmbs_file_details where id=10690502;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| lfn |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| /store/unmerged/RunIISummer20UL16MiniAODv2/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/CFC2B499-098E-0143-8A7D-BED766ED7D87.root |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)
Running the untar script already mentioned, here it is the tarball location: /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/JobCluster_4345/Job_4345927.tar.bz2
which I also made a copy under https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11232/
Inspecting the last report pickle file (Report.3.pkl) and dumping things as json, I can see that cmsRun6
is actually missing the location
attribute, which is very likely the cause for this bug. Snippet of the report jsonify is:
'cmsRun5': {'analysis': {},
'output': {'AODSIMoutput': [{'InputPFN': '/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root',
'inputpfns': ['../cmsRun4/RAWSIMoutput.root'],
'lfn': '/store/unmerged/RunIISummer20UL16RECO/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/60000/83BC5087-21BD-6140-9118-51204C0B64B9.root',
'location': 'T2_CH_CSCS',
'merged': False,
'module_label': 'AODSIMoutput',
'site': 'T2_CH_CSCS',
'cmsRun6': {'analysis': {},
'output': {'MINIAODSIMoutput': [{'acquisitionEra': 'RunIISummer20UL16MiniAODv2',
'inputpfns': ['../cmsRun5/AODSIMoutput.root'],
'lfn': '/store/unmerged/RunIISummer20UL16MiniAODv2/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/CFC2B499-098E-0143-8A7D-BED766ED7D87.root',
'merged': False,
'module_label': 'MINIAODSIMoutput',
'site': 'T2_CH_CSCS',
I am setting a file location for those 3 files such that this agent can be completely shutdown in the coming days.
For the fix, even if we cannot reproduce or identify the root cause, I am pretty sure we can update JobAccountant to actually fail a job that comes with output files without any location. It seems to happen once every 5M jobs anyways.
Okay, here we go with hopefully our last batch of problems related to this issue. vocms0282 had the following work stuck:
SUBSCRIPTION FILEID NAME TASK
------------ ---------- ------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12346 2693646 Cleanup /pdmvserv_task_SMP-RunIIFall18GS-00033__v1_T_210321_070153_1970/SMP-RunIIFall18GS-00033_0/SMP-RunIIFall18GS-00033_0CleanupUnmergedRAWSIMoutput
2415 488915 Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00164__v1_T_221116_142121_9065/BPH-RunIISummer20UL17GEN-00164_0/BPH-RunIISummer20UL17RECO-00236_0CleanupUnmergedAODSIMoutput
50627 12368897 Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00172__v1_T_221223_095339_6847/BPH-RunIISummer20UL17GEN-00172_0/BPH-RunIISummer20UL17MiniAODv2-00239_0CleanupUnmergedMINIAODSIMoutput
2416 488915 Merge /cmsunified_task_BPH-RunIISummer20UL17GEN-00164__v1_T_221116_142121_9065/BPH-RunIISummer20UL17GEN-00164_0/BPH-RunIISummer20UL17RECO-00236_0MergeAODSIMoutput
50628 12368897 Merge /cmsunified_task_BPH-RunIISummer20UL17GEN-00172__v1_T_221223_095339_6847/BPH-RunIISummer20UL17GEN-00172_0/BPH-RunIISummer20UL17MiniAODv2-00239_0MergeMINIAODSIMoutput
I picked the sites that run most of the jobs for these workflows, and updated the file location in vocms0282 database as follows:
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 2693646, (SELECT id from wmbs_pnns where pnn='T2_US_Purdue') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 488915, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12368897, (SELECT id from wmbs_pnns where pnn='T1_RU_JINR_Disk') FROM DUAL;
While draining the agent vocms0281, I found the following 6 workflows also affected by this issue:
cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-13260__v1_T_221216_194936_5512
cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-12397__v1_T_221222_104854_8737
cmsunified_task_BPH-RunIISummer20UL16GENAPV-00150__v1_T_221116_142318_5175
cmsunified_task_JME-RunIISummer20UL17GEN-00013__v1_T_221214_205126_9607
cmsunified_task_HIG-RunIISummer20UL18wmLHEGEN-12454__v1_T_221222_120250_14
cmsunified_task_HIG-RunIISummer20UL18wmLHEGEN-12437__v1_T_221216_201835_711
here are the database updates performed on this agent:
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 10023121, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12802815, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 1223760, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 6361241, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 11633841, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 10301323, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
And submit7 had this workflow with files without any location:
+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid | name | task |
+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 85490 | 12127202 | Cleanup | /cmsunified_task_B2G-RunIISummer20UL17wmLHEGEN-03621__v1_T_221221_123945_866/B2G-RunIISummer20UL17wmLHEGEN-03621_0/B2G-RunIISummer20UL17MiniAODv2-02311_0CleanupUnmergedMINIAODSIMoutput |
| 85491 | 12127202 | Merge | /cmsunified_task_B2G-RunIISummer20UL17wmLHEGEN-03621__v1_T_221221_123945_866/B2G-RunIISummer20UL17wmLHEGEN-03621_0/B2G-RunIISummer20UL17MiniAODv2-02311_0MergeMINIAODSIMoutput |
+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.016 sec)
which has been updated to:
MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12127202, (SELECT id from wmbs_pnns where pnn='T2_DE_RWTH');
And hopefully the last such incident to be reported here, this time on submit8:
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid | name | task |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| 10613 | 2549632 | Cleanup | /cmsunified_task_SMP-RunIISummer20UL17wmLHEGEN-00533__v1_T_220929_194847_9948/SMP-RunIISummer20UL17wmLHEGEN-00533_0/SMP-RunIISummer20UL17MiniAODv2-00254_0CleanupUnmergedMINIAODSIMoutput |
| 10614 | 2549632 | Merge | /cmsunified_task_SMP-RunIISummer20UL17wmLHEGEN-00533__v1_T_220929_194847_9948/SMP-RunIISummer20UL17wmLHEGEN-00533_0/SMP-RunIISummer20UL17MiniAODv2-00254_0MergeMINIAODSIMoutput |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.080 sec)
which has been fixed with:
MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn) SELECT 2549632, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk');
In case it helps, here is a log file: https://amaltaro.web.cern.ch/forWMCore/Issue_11232/Job_3956040.tar.bz2
where the first job attempt failed with in JobAccountant with:
2024-08-07 16:11:58,578:140543013328640:WARNING:AccountantWorker:Job 3956040 accepted for multi-step CMSSW, even though the expected outputModules does not match content of the FWJR.
2024-08-07 16:11:58,579:140543013328640:WARNING:AccountantWorker:The following file does not have any location: {'lfn': '/store/unmerged/Run3Summer22DRPremix/JPsiMuMuMuMu_JPsiNoFilter_4MuPtEtaFilter_TuneCP5_13p6TeV-pythia8-evtgen/AODSIM/124X_mcRun3_2022_realistic_v12-v3/2810016/087a60ad-655e-4eb8-94a3-6d07d088a9a1.root', 'size': 38562758, 'events': 88, 'checksums': {'adler32': 'f8d1041e', 'cksum': '3468691647'}, 'runs': {<WMCore.DataStructs.Run.Run object at 0x7fd2afeb6670>}, 'merged': False, 'last_event': 0, 'first_event': 0, 'locations': set(), 'parents': set(), 'pfn': '/srv/job/WMTaskSpace/cmsRun3/AODSIMoutput.root', 'branches': [], 'input': [''], 'inputpfns': ['file:../cmsRun2/PREMIXRAWoutput.root'], 'branch_hash': '66aedf5878c8cb3b708d7fca0fa6bce1', 'catalog': '', 'guid': '087a60ad-655e-4eb8-94a3-6d07d088a9a1', 'module_label': 'AODSIMoutput', 'dataset': {'applicationName': 'cmsRun', 'applicationVersion': 'CMSSW_12_4_16', 'primaryDataset': 'JPsiMuMuMuMu_JPsiNoFilter_4MuPtEtaFilter_TuneCP5_13p6TeV-pythia8-evtgen', 'processedDataset': 'Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v3', 'dataTier': 'AODSIM'}, 'acquisitionEra': 'Run3Summer22DRPremix', 'processingVer': 3, 'validStatus': 'PRODUCTION', 'globalTag': '124X_mcRun3_2022_realistic_v12', 'prep_id': 'BPH-Run3Summer22DRPremix-00185', 'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;3c9eba5165582b80e1c9b828b7b0945d', 'inputPath': None, 'outputModule': 'AODSIMoutput', 'fileRef': <WMCore.Configuration.ConfigSection object at 0x7fd2aef2a730>}
2024-08-07 16:11:58,579:140543013328640:WARNING:AccountantWorker:Job 3956040 , bad jobReport, failing job
while the next retry went through (but then through a different site).
Impact of the bug WMAgent
Describe the bug I am creating this issue because it's the second or third time that I see such problems with agents in the 2.0.2 series.
Running the drainAgent.py script on vocms0282, returns a report like: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11232/drain_vocms0282.log with the following unfinished work:
Running
in both local and global workqueueHow to reproduce it Needs to be debugged
Expected behavior First, we need to find out why those files remain available for those subscriptions (JobCreator should pick those up, create jobs and move those files to the list of acquired files for those same subscriptions). This might trigger the required actions to complete all the others.
Additional context and error message None