Closed amaltaro closed 2 months ago
Looking at the ParentlessMergeBySize algorithm, I noticed it doesn't see any of those files in available. And the reason is that those 2 files have no location associated with them, see:
select * from wmbs_sub_files_available
INNER JOIN wmbs_file_runlumi_map ON wmbs_sub_files_available.fileid = wmbs_file_runlumi_map.fileid
INNER JOIN wmbs_file_location ON wmbs_sub_files_available.fileid = wmbs_file_location.fileid
where subscription = 73398;
Going deeper in one of those job reports, indeed there is no location and the Report getAllFiles()
method returns the following data:
>>> fileList = jobReport.getAllFiles()
>>> pprint(fileList)
[{'acquisitionEra': 'RunIIFall18GS',
'branch_hash': '6853840b45c575051050a6e81a59f2d0',
'branches': [],
'catalog': '',
'checksums': {'adler32': '6335e421', 'cksum': '1333373001'},
'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;1126296323ab39f89913b87b1825c605',
'dataset': {'applicationName': 'cmsRun',
'applicationVersion': 'CMSSW_10_2_15_patch2',
'dataTier': 'GEN-SIM',
'primaryDataset': 'MinBias_TuneCP5_13TeV_NeutronHP-pythia8',
'processedDataset': 'RunIIFall18GS-SNB_102X_upgrade2018_realistic_v17-v1'},
'events': 195,
'fileRef': <WMCore.Configuration.ConfigSection object at 0x7f750fcd0890>,
'first_event': 0,
'globalTag': '102X_upgrade2018_realistic_v17',
'guid': '9C86D47B-110A-8446-BF89-4698B179E5E0',
'input': [],
'inputPath': None,
'inputpfns': [],
'last_event': 0,
'lfn': '/store/unmerged/RunIIFall18GS/MinBias_TuneCP5_13TeV_NeutronHP-pythia8/GEN-SIM/SNB_102X_upgrade2018_realistic_v17-v1/40008/9C86D47B-110A-8446-BF89-4698B179E5E0.root',
'locations': set([None]),
'merged': False,
'module_label': 'RAWSIMoutput',
'outputModule': 'RAWSIMoutput',
'parents': set([]),
'pfn': '/srv/job/WMTaskSpace/cmsRun1/RAWSIMoutput.root',
'prep_id': 'MUO-RunIIFall18GS-00027',
'processingVer': 1,
'runs': set([<WMCore.DataStructs.Run.Run object at 0x7f750fcd0390>]),
'size': 68012023,
'validStatus': 'PRODUCTION'},
{'acquisitionEra': None,
'branch_hash': None,
'branches': [],
'catalog': '',
'checksums': {'adler32': 'f5148aaa', 'cksum': '206908913'},
'configURL': None,
'dataset': {},
'events': 0,
'fileRef': <WMCore.Configuration.ConfigSection object at 0x7f750fcd0ed0>,
'first_event': 0,
'globalTag': None,
'guid': '',
'input': [],
'inputPath': None,
'inputpfns': [],
'last_event': 0,
'lfn': '/store/unmerged/logs/prod/2019/7/12/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/0008/1/0fa2ce62-7299-459c-a995-bb47080bd0eb-854-1-logArchive.tar.gz',
'locations': set(['T2_CH_CSCS']),
'merged': False,
'module_label': 'logArchive',
'outputModule': 'logArchive',
'parents': set([]),
'pfn': 'srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/logs/prod/2019/7/12/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/0008/1/0fa2ce62-7299-459c-a995-bb47080bd0eb-854-1-logArchive.tar.gz',
'prep_id': None,
'processingVer': None,
'runs': set([]),
'size': 0,
'validStatus': None}]
>>>
while the logArchive tarball has the proper location set (puzzled!!!).
I compared the job report against other successful jobs and indeed it's missing the location
attribute, and that's why that subscription got stuck and no jobs get created.
Now looking at the runtime code, we do set the location
attribute there:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/StageOut.py#L202
which uses the PNN
value set here:
https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutMgr.py#L205
So I fail to see why it was not set for those 2 jobs...
Another possible fix would be to guarantee that JobAccountant will fail such jobs if they are missing an output's file location. I'll check that approach.
I just found the same problem, for the same workflow, for the same site, on vocms0280, job is: install/wmagent/JobArchiver/logDir/p/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/JobCluster_547/Job_547963.tar.bz2
I have set the output unmerged file location to T2_CH_CSCS and a job got created.
While trying to get vocms0283 completely drained, I noticed that the agent has a few merge files available for job creation (and a bunch other for other subscriptions) for this workflow: pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558
which has 40 LQE in Running
state at the moment.
Apparently the bug fix in #9354 is only available in the latest stable release (1.2.6).
So, either I'm following the wrong approach, or this process to map an output file to a job is overly complicated (one could possible use WMArchive as well).
Here is what I did on vocms0283.
Get a map of workflow task, subscription id, fileset, number of files available for each subscription and so on (one has to provide the workflow name in the query):
SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType, count(wmbs_sub_files_available.fileid) AS filesAvail
FROM wmbs_subscription
INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
INNER JOIN wmbs_sub_files_available ON wmbs_sub_files_available.subscription = wmbs_subscription.id
WHERE wmbs_workflow.name='pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558'
AND wmbs_subscription.finished = 0
GROUP BY wmbs_subscription.id, wmbs_workflow.task, wmbs_workflow.injected, wmbs_fileset.name, wmbs_fileset.open,
wmbs_subscription.finished, wmbs_sub_types.name;
We're mostly interested in the Merge
subscription type, so we look at that row and from the workflow task name
/pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1/unmerged-AODSIMoutputAODSIM
we extract the task that created those files that are now available, the task path is:
/pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1
Using the subscription id above, we list the file id and run/lumi information:
select wsfa.fileid, wfrm.run, wfrm.lumi from wmbs_sub_files_available wsfa
inner join wmbs_file_details ON wmbs_file_details.id = wsfa.fileid
inner join wmbs_file_runlumi_map wfrm ON wfrm.fileid = wmbs_file_details.id
where wsfa.subscription=33544 ORDER BY wfrm.lumi;
Now the painful part, given the task path and one of the lumis from the previous command, we can check for jobs that were processing lumis between a given range, e.g.:
SELECT wmbs_job_mask.*, wmbs_workflow.task FROM wmbs_job_mask
INNER JOIN wmbs_job ON wmbs_job.id = wmbs_job_mask.job
INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
WHERE wmbs_workflow.task = '/pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1'
AND wmbs_job_mask.firstlumi >= 8541 and lastlumi <=8557;
then we finally match the lumis on the files available against the lumi range assigned to jobs.
So I started looking for a job that processed the following fileid / lumi:
10268555 8541
10268555 8542
which mapped to job 3550823
.
Looking at the job wmagentJob.log, it wrote to T2_CH_CSCS:
2019-09-06 23:10:15,907:INFO:StageOutMgr:===> Stage Out Successful: {'StageOutReport': [], 'LFN': '/store/unmerged/RunIISummer16DR80/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8/AODSIM/NoPU_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/130010/3461849D-FAD0-E911-8FE8-00010100093D.root', 'PFN': 'srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/RunIISummer16DR80/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8/AODSIM/NoPU_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/130010/3461849D-FAD0-E911-8FE8-00010100093D.root', 'PNN': 'T2_CH_CSCS', 'Checksums': {'adler32': '3fb0d152', 'cksum': '2258890055'}, 'StageOutCommand': 'gfal2'}
and looking into the job report
>>> import pickle
>>> from WMCore.FwkJobReport.Report import Report
>>> from pprint import pprint
>>> jr = Report()
>>> jr.load("./Job_3550823/Report.1.pkl")
>>> pprint(jr.getAllFiles())
and voila
...
'locations': set([None]),
...
There must be a problem with that site which generates those no location jobs...
I haven't found a smart way to insert multiple rows based on an WHERE IN
clause, so here it goes... find the PNN id and the file IDs to be inserted, from:
SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS';
SELECT fileid FROM wmbs_sub_files_available WHERE subscription=33544;
then inserting it
INSERT ALL
INTO wmbs_file_location (fileid, pnn) VALUES (10267223, 13)
INTO wmbs_file_location (fileid, pnn) VALUES (10268555, 13)
INTO wmbs_file_location (fileid, pnn) VALUES (10274827, 13)
INTO wmbs_file_location (fileid, pnn) VALUES (10276197, 13)
INTO wmbs_file_location (fileid, pnn) VALUES (10277661, 13)
INTO wmbs_file_location (fileid, pnn) VALUES (10284803, 13)
SELECT * FROM dual;
commit;
I just fixed a few other files that were missing the T2_CH_CSCS location as well.
Now a much easier (and generic) way to find files sitting in available without a correspondent location in wmbs_file_location (e.g. on submit7):
MariaDB [wmagent]> select * from wmbs_sub_files_available wsfa
-> WHERE NOT EXISTS
-> ( SELECT null FROM wmbs_file_location wfl WHERE wfl.fileid = wsfa.fileid);
+--------------+----------+
| subscription | fileid |
+--------------+----------+
| 24931 | 10697320 |
| 24933 | 10697320 |
+--------------+----------+
2 rows in set (0.17 sec)
Files fixed on submit6 as well:
INSERT INTO wmbs_file_location (fileid, pnn)
VALUES
(6581639, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(12719113, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(12719153, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(13259926, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(14261861, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(14262255, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
(14319865, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS'));
Ok, in principle everything is under control now. However, we are still to completely understand how it happens. So my suggestion would be to run a small workflow at T2_CH_CSCS and check whether it's reproducible or not. If so, keep debugging to see why the root file has no locations while the logArchive does.
Just a note, the new agent (1.2.6) should fail such bad jobs at JobAccountant level.
While the root cause hasn't been completed understood - and we were not able to reproduce this error in T2_CH_CSCS - a couple of workarounds have been implemented in the past years. The latest one is: https://github.com/dmwm/WMCore/pull/12093
With the patch above (and the one before it), we identify such cases in JobAccountant and make sure such jobs are not accepted in the agent, triggering a job retry if possible.
I am closing this issue out.
Impact of the bug It affects the workflow pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573 on vocms0254. Apparently other agents are fine.
Describe the bug While draining the agent, I noticed it still has GQE/LQE in the running state and workflow/subscriptions unfinished. Where it still has to create jobs for these 3 subscriptions:
Those 2 files available for the merge subscription correspond to:
and playing around with Oracle I managed to find that those 2 files were created by job IDs 2069009 and 1893563.
How to reproduce it Steps to reproduce the behavior:
Expected behavior Get jobs created and get those files completed and the workflow completed. OR, check whether output files have a valid location, if not, fail the job.
Additional context and error message A map of workflow/subscription/fileset can be found here