dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Unmerged files inserted into WMBS without any location, thus being forever stuck #11232

Closed amaltaro closed 1 year ago

amaltaro commented 2 years ago

Impact of the bug WMAgent

Describe the bug I am creating this issue because it's the second or third time that I see such problems with agents in the 2.0.2 series.

Running the drainAgent.py script on vocms0282, returns a report like: https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11232/drain_vocms0282.log with the following unfinished work:

How to reproduce it Needs to be debugged

Expected behavior First, we need to find out why those files remain available for those subscriptions (JobCreator should pick those up, create jobs and move those files to the list of acquired files for those same subscriptions). This might trigger the required actions to complete all the others.

Additional context and error message None

amaltaro commented 2 years ago

I didn't check everything, but zooming in in a ParentlessMergeBySize for the following subscriptions:

47068   /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0MergeMINIAODSIMoutput

we have one file available for job creation, see:

> select * from wmbs_sub_files_available where subscription=47068;
subscription      fileid
47068   7150962

The way files are loaded for this subscription algorithm is done through this DAO: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMBS/MySQL/Subscriptions/GetFilesForParentlessMerge.py#L25

which returns an empty list. That means, one of those constraints is not being fulfilled... and here it is, that file id does not have any location:

> select * from wmbs_file_location where fileid=7150962;

CONCLUSION: jobs are not getting created because output unmerged files are inserted into the database without any location, thus keeping that subscription - and its dependencies - stuck. For vocms0282, we have a total of 6 files available without any location, explaining those 3 workflows still stuck in this agent. Query to be used can be:

SELECT wsfa.* FROM wmbs_sub_files_available wsfa
  LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
  WHERE wfl.fileid is NULL;

but to see the task name and subscription type, one can expand it to:

SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
  LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
  INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
  INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
  INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
  WHERE wfl.fileid is NULL;
subscription  fileid     name        task
58577   12982636    Cleanup /cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586/BPH-RunIISummer20UL16GEN-00127_0/BPH-RunIISummer20UL16MiniAODv2-00169_0CleanupUnmergedMINIAODSIMoutput
47067   7150962 Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0CleanupUnmergedMINIAODSIMoutput
65217   12841725    Cleanup /cmsunified_task_BPH-RunIISummer20UL18GEN-00146__v1_T_220316_105630_9389/BPH-RunIISummer20UL18GEN-00146_0/BPH-RunIISummer20UL18MiniAODv2-00211_0CleanupUnmergedMINIAODSIMoutput
58578   12982636    Merge   /cmsunified_task_BPH-RunIISummer20UL16GEN-00127__v1_T_220316_105637_8586/BPH-RunIISummer20UL16GEN-00127_0/BPH-RunIISummer20UL16MiniAODv2-00169_0MergeMINIAODSIMoutput
47068   7150962 Merge   /cmsunified_task_BPH-RunIISummer20UL17GEN-00127__v1_T_220224_123156_2874/BPH-RunIISummer20UL17GEN-00127_0/BPH-RunIISummer20UL17MiniAODv2-00167_0MergeMINIAODSIMoutput
65218   12841725    Merge   /cmsunified_task_BPH-RunIISummer20UL18GEN-00146__v1_T_220316_105630_9389/BPH-RunIISummer20UL18GEN-00146_0/BPH-RunIISummer20UL18MiniAODv2-00211_0MergeMINIAODSIMoutput

To move forward with this agent draining, I will set any location - actually the most common for those workflows - for those 3 files waiting to be merged.

amaltaro commented 2 years ago

Now that we know what the problem really is, I thought it would be better to repurpose this issue (instead making it a one agent problem).

amaltaro commented 2 years ago

And this issue happened again, this time on vocms0255, for this workflow: cmsunified_task_BPH-RunIISummer20UL16GENAPV-00121__v1_T_220316_105601_9405

I copied all the tarballs from JobArchiver and run this script: https://github.com/amaltaro/ProductionTools/blob/master/untarLogArchive.py

on all of them. Without any success in finding the LFN that didn't have a PNN associated with, which is:

SQL> select * from wmbs_sub_files_available where subscription=30;
SUBSCRIPTION     FILEID
------------ ----------
      30     892671

SQL> select lfn from wmbs_file_details where id=892671;
ID      LFN                                                                                                                                                                                             FILESIZE    EVENTS  FIRST_EVENT M
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
892671  /store/unmerged/RunIISummer20UL16MiniAODAPVv2/LambdaBTopKMuMu_TuneCP5_13TeV-pythia8-evtgen/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v2/2550004/237CF230-F455-4C46-BEE4-C0A8CF759F70.root    5525759     78      2125229681  0

This node was running out of space for many days and I had to actively delete data to make room, perhaps one of that information deleted contained this output file... anyhow, I will get back to this next time it occurs.

amaltaro commented 2 years ago

And this happened to vocms0256 as well, for the following file:

select lfn from wmbs_file_details where id=8374072;
/store/unmerged/RunIISummer20UL16MiniAODv2/ZH_HToZZTo4Q_2LFilter_M2500_TuneCP5_13TeV_powheg2-minlo-HZJ_JHUGenV735_pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v3/2560008/B0D156FB-7D6B-B141-971D-29518395A913.root

and these are the subscriptions:

SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
  LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
  INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
  INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
  INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
  WHERE wfl.fileid is NULL;

SUBSCRIPTION    FILEID      NAME        TASK
----------------------------------------------------------------------------------------------------------------------------------------------------------------
46109           8374072     Cleanup     /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-00772__v1_T_220602_160150_1846/HIG-RunIISummer20UL16wmLHEGEN-00772_0/HIG-RunIISummer20UL16MiniAODv2-04037_0CleanupUnmergedMINIAODSIMoutput
46110           8374072     Merge       /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-00772__v1_T_220602_160150_1846/HIG-RunIISummer20UL16wmLHEGEN-00772_0/HIG-RunIISummer20UL16MiniAODv2-04037_0MergeMINIAODSIMoutput

and again I untar'ed all the log tarballs under the JobArchiver component, and yet again it hasn't found any occurrences of that file UID.

I think we will have to start logging the file names produced by a job in the logs, such that we can eventually track this down.

amaltaro commented 1 year ago

And we have this same problem with vocms0253 as well. Here is a list of subscriptions with files without any location:

SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
  LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
  INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
  INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
  INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
  WHERE wfl.fileid is NULL;

SUBSCRIPTION    FILEID      NAME        TASK
----------------------------------------------------------------------------------------------------------------------------------------------------------------
42287   7942382 Cleanup /cmsunified_task_EXO-RunIISummer20UL17wmLHEGEN-02150__v1_T_220727_175210_5336/EXO-RunIISummer20UL17wmLHEGEN-02150_0/EXO-RunIISummer20UL17MiniAODv2-02446_0CleanupUnmergedMINIAODSIMoutput
42288   7942382 Merge   /cmsunified_task_EXO-RunIISummer20UL17wmLHEGEN-02150__v1_T_220727_175210_5336/EXO-RunIISummer20UL17wmLHEGEN-02150_0/EXO-RunIISummer20UL17MiniAODv2-02446_0MergeMINIAODSIMoutput
44786   8436720 Merge   /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0/HIG-RunIISummer20UL16MiniAODAPVv2-07800_0MergeMINIAODSIMoutput
44785   8436720 Cleanup /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0/HIG-RunIISummer20UL16MiniAODAPVv2-07800_0CleanupUnmergedMINIAODSIMoutput
45127   8988497 Cleanup /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09890__v1_T_220720_133208_1929/HIG-RunIISummer20UL16wmLHEGENAPV-09890_0/HIG-RunIISummer20UL16MiniAODAPVv2-08417_0CleanupUnmergedMINIAODSIMoutput
45128   8988497 Merge   /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09890__v1_T_220720_133208_1929/HIG-RunIISummer20UL16wmLHEGENAPV-09890_0/HIG-RunIISummer20UL16MiniAODAPVv2-08417_0MergeMINIAODSIMoutput

with the same pattern as the one above, one Merge and one Cleanup task for each workflow.

I zoomed in in the following workflow: cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946

and I could not file any traces of this file

/store/unmerged/RunIISummer20UL16MiniAODAPVv2/NMSSM_XToYHTo2Tau2B_MX-3000_MY-2400_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_preVFP_v11-v2/2530000/B4E4A367-EFF8-2B4F-9EC4-3A7C28B7EC00.root

in the job logs from JobArchiver (wmagentJob.log) neither from any of the tarballs uploaded to EOSCMS: https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PRODUCTION/cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-09137__v1_T_220815_213551_9946/HIG-RunIISummer20UL16wmLHEGENAPV-09137_0

To be able to complete these workflows, I am going to set a location for those files, trying to pick the most common location. For the workflow above, this is what I did in the database:

INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 8436720, (SELECT id from wmbs_pnns where pnn='T1_DE_KIT_Disk') FROM DUAL;

the other two have been updated with (note that the location might not be the correct one):

INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 7942382, (SELECT id from wmbs_pnns where pnn='T2_US_Purdue') FROM DUAL;

INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 8988497, (SELECT id from wmbs_pnns where pnn='T2_US_Wisconsin') FROM DUAL;

As for better debugging it, I think we need to record in the JobAccountant log all the output files created by a given job.

amaltaro commented 1 year ago

Another update on this, this time on submit5 with the following workflow stuck in running-closed: cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662

Performing the same procedure as in the previous workflows, querying for files without any location in WMBS, gives us this:

MariaDB [wmagent]> SELECT wsfa.*, wst.name, ww.task FROM wmbs_sub_files_available wsfa
    ->   LEFT JOIN wmbs_file_location wfl ON wfl.fileid = wsfa.fileid
    ->   INNER JOIN wmbs_subscription ws ON ws.id = wsfa.subscription
    ->   INNER JOIN wmbs_workflow ww ON ww.id = ws.workflow
    ->   INNER JOIN wmbs_sub_types wst ON wst.id = ws.subtype
    ->   WHERE wfl.fileid is NULL;
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid  | name    | task                                                                                                                                                                                      |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        19099 | 1357104 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662/HIG-RunIISummer20UL17wmLHEGEN-00638_0/HIG-RunIISummer20UL17MiniAODv2-06201_0CleanupUnmergedMINIAODSIMoutput |
|        19100 | 1357104 | Merge   | /cmsunified_task_HIG-RunIISummer20UL17wmLHEGEN-00638__v1_T_220623_202351_1662/HIG-RunIISummer20UL17wmLHEGEN-00638_0/HIG-RunIISummer20UL17MiniAODv2-06201_0MergeMINIAODSIMoutput           |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.001 sec)

FNAL and NERSC seem to be popular destinations for this workflow, so here we go defining that file as available in that location:

MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn)
SELECT 1357104, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk');

with no need to commit these changes, given that MariaDB has auto-commit already enabled.

amaltaro commented 1 year ago

And I found 3 workflows stuck in submit6, as follows

+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid   | name    | task                                                                                                                                                                                                                                                    |
+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        35719 |  5596650 | Cleanup | /pdmvserv_task_SUS-RunIIFall17FSPremix-00207__v1_T_220929_170653_8282/SUS-RunIIFall17FSPremix-00207_0/SUS-RunIIFall17FSPremix-00207_0MergeAODSIMoutput/SUS-RunIIFall17MiniAODv2-00642_0/SUS-RunIIFall17MiniAODv2-00642_0CleanupUnmergedMINIAODSIMoutput |
|        35720 |  5596650 | Merge   | /pdmvserv_task_SUS-RunIIFall17FSPremix-00207__v1_T_220929_170653_8282/SUS-RunIIFall17FSPremix-00207_0/SUS-RunIIFall17FSPremix-00207_0MergeAODSIMoutput/SUS-RunIIFall17MiniAODv2-00642_0/SUS-RunIIFall17MiniAODv2-00642_0MergeMINIAODSIMoutput           |
|        74269 |  9644986 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-11204__v1_T_220908_160011_6520/HIG-RunIISummer20UL16wmLHEGENAPV-11204_0/HIG-RunIISummer20UL16MiniAODAPVv2-09387_0CleanupUnmergedMINIAODSIMoutput                                                      |
|        74270 |  9644986 | Merge   | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-11204__v1_T_220908_160011_6520/HIG-RunIISummer20UL16wmLHEGENAPV-11204_0/HIG-RunIISummer20UL16MiniAODAPVv2-09387_0MergeMINIAODSIMoutput                                                                |
|        89125 | 10690502 | Cleanup | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/HIG-RunIISummer20UL16wmLHEGEN-11893_0/HIG-RunIISummer20UL16MiniAODv2-10758_0CleanupUnmergedMINIAODSIMoutput                                                               |
|        89126 | 10690502 | Merge   | /cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/HIG-RunIISummer20UL16wmLHEGEN-11893_0/HIG-RunIISummer20UL16MiniAODv2-10758_0MergeMINIAODSIMoutput                                                                         |
+--------------+----------+---------+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
6 rows in set (0.008 sec)

luckily, this time I could find a log tarball for the _4996 workflow above, where this file has no location:

MariaDB [wmagent]> select lfn from wmbs_file_details where id=10690502;
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| lfn                                                                                                                                                                                                        |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| /store/unmerged/RunIISummer20UL16MiniAODv2/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/CFC2B499-098E-0143-8A7D-BED766ED7D87.root |
+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row in set (0.000 sec)

Running the untar script already mentioned, here it is the tarball location: /data/srv/wmagent/current/install/wmagentpy3/JobArchiver/logDir/c/cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-11893__v1_T_220908_132706_4996/JobCluster_4345/Job_4345927.tar.bz2

which I also made a copy under https://amaltaro.web.cern.ch/amaltaro/forWMCore/Issue_11232/

Inspecting the last report pickle file (Report.3.pkl) and dumping things as json, I can see that cmsRun6 is actually missing the location attribute, which is very likely the cause for this bug. Snippet of the report jsonify is:

           'cmsRun5': {'analysis': {},
                       'output': {'AODSIMoutput': [{'InputPFN': '/srv/job/WMTaskSpace/cmsRun5/AODSIMoutput.root',
                                                    'inputpfns': ['../cmsRun4/RAWSIMoutput.root'],
                                                    'lfn': '/store/unmerged/RunIISummer20UL16RECO/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/AODSIM/106X_mcRun2_asymptotic_v13-v2/60000/83BC5087-21BD-6140-9118-51204C0B64B9.root',
                                                    'location': 'T2_CH_CSCS',
                                                    'merged': False,
                                                    'module_label': 'AODSIMoutput',
                       'site': 'T2_CH_CSCS',

           'cmsRun6': {'analysis': {},
                       'output': {'MINIAODSIMoutput': [{'acquisitionEra': 'RunIISummer20UL16MiniAODv2',
                                                        'inputpfns': ['../cmsRun5/AODSIMoutput.root'],
                                                        'lfn': '/store/unmerged/RunIISummer20UL16MiniAODv2/NMSSM_XToYHTo2W2BTo4Q2B_MX-3500_MY-1600_TuneCP5_13TeV-madgraph-pythia8/MINIAODSIM/106X_mcRun2_asymptotic_v17-v2/60000/CFC2B499-098E-0143-8A7D-BED766ED7D87.root',
                                                        'merged': False,
                                                        'module_label': 'MINIAODSIMoutput',
                       'site': 'T2_CH_CSCS',

I am setting a file location for those 3 files such that this agent can be completely shutdown in the coming days.

For the fix, even if we cannot reproduce or identify the root cause, I am pretty sure we can update JobAccountant to actually fail a job that comes with output files without any location. It seems to happen once every 5M jobs anyways.

amaltaro commented 1 year ago

Okay, here we go with hopefully our last batch of problems related to this issue. vocms0282 had the following work stuck:

SUBSCRIPTION     FILEID NAME    TASK
------------ ---------- ------- ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
       12346    2693646 Cleanup /pdmvserv_task_SMP-RunIIFall18GS-00033__v1_T_210321_070153_1970/SMP-RunIIFall18GS-00033_0/SMP-RunIIFall18GS-00033_0CleanupUnmergedRAWSIMoutput
        2415    488915  Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00164__v1_T_221116_142121_9065/BPH-RunIISummer20UL17GEN-00164_0/BPH-RunIISummer20UL17RECO-00236_0CleanupUnmergedAODSIMoutput
       50627   12368897 Cleanup /cmsunified_task_BPH-RunIISummer20UL17GEN-00172__v1_T_221223_095339_6847/BPH-RunIISummer20UL17GEN-00172_0/BPH-RunIISummer20UL17MiniAODv2-00239_0CleanupUnmergedMINIAODSIMoutput
        2416     488915 Merge   /cmsunified_task_BPH-RunIISummer20UL17GEN-00164__v1_T_221116_142121_9065/BPH-RunIISummer20UL17GEN-00164_0/BPH-RunIISummer20UL17RECO-00236_0MergeAODSIMoutput
       50628   12368897 Merge   /cmsunified_task_BPH-RunIISummer20UL17GEN-00172__v1_T_221223_095339_6847/BPH-RunIISummer20UL17GEN-00172_0/BPH-RunIISummer20UL17MiniAODv2-00239_0MergeMINIAODSIMoutput

I picked the sites that run most of the jobs for these workflows, and updated the file location in vocms0282 database as follows:

INSERT INTO wmbs_file_location (fileid, pnn) SELECT 2693646, (SELECT id from wmbs_pnns where pnn='T2_US_Purdue') FROM DUAL;

INSERT INTO wmbs_file_location (fileid, pnn) SELECT 488915, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;

INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12368897, (SELECT id from wmbs_pnns where pnn='T1_RU_JINR_Disk') FROM DUAL;
amaltaro commented 1 year ago

While draining the agent vocms0281, I found the following 6 workflows also affected by this issue:

cmsunified_task_HIG-RunIISummer20UL16wmLHEGEN-13260__v1_T_221216_194936_5512
cmsunified_task_HIG-RunIISummer20UL16wmLHEGENAPV-12397__v1_T_221222_104854_8737
cmsunified_task_BPH-RunIISummer20UL16GENAPV-00150__v1_T_221116_142318_5175
cmsunified_task_JME-RunIISummer20UL17GEN-00013__v1_T_221214_205126_9607
cmsunified_task_HIG-RunIISummer20UL18wmLHEGEN-12454__v1_T_221222_120250_14
cmsunified_task_HIG-RunIISummer20UL18wmLHEGEN-12437__v1_T_221216_201835_711

here are the database updates performed on this agent:

INSERT INTO wmbs_file_location (fileid, pnn) SELECT 10023121, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12802815, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 1223760, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 6361241, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 11633841, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
INSERT INTO wmbs_file_location (fileid, pnn) SELECT 10301323, (SELECT id from wmbs_pnns where pnn='T1_FR_CCIN2P3_Disk') FROM DUAL;
amaltaro commented 1 year ago

And submit7 had this workflow with files without any location:

+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid   | name    | task                                                                                                                                                                                     |
+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        85490 | 12127202 | Cleanup | /cmsunified_task_B2G-RunIISummer20UL17wmLHEGEN-03621__v1_T_221221_123945_866/B2G-RunIISummer20UL17wmLHEGEN-03621_0/B2G-RunIISummer20UL17MiniAODv2-02311_0CleanupUnmergedMINIAODSIMoutput |
|        85491 | 12127202 | Merge   | /cmsunified_task_B2G-RunIISummer20UL17wmLHEGEN-03621__v1_T_221221_123945_866/B2G-RunIISummer20UL17wmLHEGEN-03621_0/B2G-RunIISummer20UL17MiniAODv2-02311_0MergeMINIAODSIMoutput           |
+--------------+----------+---------+------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.016 sec)

which has been updated to:

MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn) SELECT 12127202, (SELECT id from wmbs_pnns where pnn='T2_DE_RWTH');
amaltaro commented 1 year ago

And hopefully the last such incident to be reported here, this time on submit8:

+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| subscription | fileid  | name    | task                                                                                                                                                                                      |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|        10613 | 2549632 | Cleanup | /cmsunified_task_SMP-RunIISummer20UL17wmLHEGEN-00533__v1_T_220929_194847_9948/SMP-RunIISummer20UL17wmLHEGEN-00533_0/SMP-RunIISummer20UL17MiniAODv2-00254_0CleanupUnmergedMINIAODSIMoutput |
|        10614 | 2549632 | Merge   | /cmsunified_task_SMP-RunIISummer20UL17wmLHEGEN-00533__v1_T_220929_194847_9948/SMP-RunIISummer20UL17wmLHEGEN-00533_0/SMP-RunIISummer20UL17MiniAODv2-00254_0MergeMINIAODSIMoutput           |
+--------------+---------+---------+-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 rows in set (0.080 sec)

which has been fixed with:

MariaDB [wmagent]> INSERT INTO wmbs_file_location (fileid, pnn) SELECT 2549632, (SELECT id from wmbs_pnns where pnn='T1_US_FNAL_Disk');
amaltaro commented 2 months ago

In case it helps, here is a log file: https://amaltaro.web.cern.ch/forWMCore/Issue_11232/Job_3956040.tar.bz2

where the first job attempt failed with in JobAccountant with:

2024-08-07 16:11:58,578:140543013328640:WARNING:AccountantWorker:Job 3956040 accepted for multi-step CMSSW, even though the expected outputModules does not match content of the FWJR.
2024-08-07 16:11:58,579:140543013328640:WARNING:AccountantWorker:The following file does not have any location: {'lfn': '/store/unmerged/Run3Summer22DRPremix/JPsiMuMuMuMu_JPsiNoFilter_4MuPtEtaFilter_TuneCP5_13p6TeV-pythia8-evtgen/AODSIM/124X_mcRun3_2022_realistic_v12-v3/2810016/087a60ad-655e-4eb8-94a3-6d07d088a9a1.root', 'size': 38562758, 'events': 88, 'checksums': {'adler32': 'f8d1041e', 'cksum': '3468691647'}, 'runs': {<WMCore.DataStructs.Run.Run object at 0x7fd2afeb6670>}, 'merged': False, 'last_event': 0, 'first_event': 0, 'locations': set(), 'parents': set(), 'pfn': '/srv/job/WMTaskSpace/cmsRun3/AODSIMoutput.root', 'branches': [], 'input': [''], 'inputpfns': ['file:../cmsRun2/PREMIXRAWoutput.root'], 'branch_hash': '66aedf5878c8cb3b708d7fca0fa6bce1', 'catalog': '', 'guid': '087a60ad-655e-4eb8-94a3-6d07d088a9a1', 'module_label': 'AODSIMoutput', 'dataset': {'applicationName': 'cmsRun', 'applicationVersion': 'CMSSW_12_4_16', 'primaryDataset': 'JPsiMuMuMuMu_JPsiNoFilter_4MuPtEtaFilter_TuneCP5_13p6TeV-pythia8-evtgen', 'processedDataset': 'Run3Summer22DRPremix-124X_mcRun3_2022_realistic_v12-v3', 'dataTier': 'AODSIM'}, 'acquisitionEra': 'Run3Summer22DRPremix', 'processingVer': 3, 'validStatus': 'PRODUCTION', 'globalTag': '124X_mcRun3_2022_realistic_v12', 'prep_id': 'BPH-Run3Summer22DRPremix-00185', 'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;3c9eba5165582b80e1c9b828b7b0945d', 'inputPath': None, 'outputModule': 'AODSIMoutput', 'fileRef': <WMCore.Configuration.ConfigSection object at 0x7fd2aef2a730>}
2024-08-07 16:11:58,579:140543013328640:WARNING:AccountantWorker:Job 3956040 , bad jobReport, failing job

while the next retry went through (but then through a different site).