dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Files stuck in available for a merge subscription #9353

Closed amaltaro closed 2 months ago

amaltaro commented 5 years ago

Impact of the bug It affects the workflow pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573 on vocms0254. Apparently other agents are fine.

Describe the bug While draining the agent, I noticed it still has GQE/LQE in the running state and workflow/subscriptions unfinished. Where it still has to create jobs for these 3 subscriptions:

*** SUBSCRIPTIONS: found 3 files available in WMBS (waiting for job creation):
[{'count(*)': 2, 'subscription': 73400}, {'count(*)': 2, 'subscription': 73398}, {'count(*)': 479, 'subscription': 73399}]

Those 2 files available for the merge subscription correspond to:

ID             LFN
4575115 /store/unmerged/RunIIFall18GS/MinBias_TuneCP5_13TeV_NeutronHP-pythia8/GEN-SIM/SNB_102X_upgrade2018_realistic_v17-v1/40003/2588AC8C-A7CD-9B46-A140-22789FB9ED4F.root
5087610 /store/unmerged/RunIIFall18GS/MinBias_TuneCP5_13TeV_NeutronHP-pythia8/GEN-SIM/SNB_102X_upgrade2018_realistic_v17-v1/40008/9C86D47B-110A-8446-BF89-4698B179E5E0.root

and playing around with Oracle I managed to find that those 2 files were created by job IDs 2069009 and 1893563.

How to reproduce it Steps to reproduce the behavior:

Expected behavior Get jobs created and get those files completed and the workflow completed. OR, check whether output files have a valid location, if not, fail the job.

Additional context and error message A map of workflow/subscription/fileset can be found here

73399   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/MUO-RunIIFall18GS-00027_0MergeRAWSIMoutput/MUO-RunIIFall18GS-00027_0RAWSIMoutputMergeLogCollect   1   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/MUO-RunIIFall18GS-00027_0MergeRAWSIMoutput/merged-logArchive  1   0   LogCollect
73398   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/MUO-RunIIFall18GS-00027_0MergeRAWSIMoutput    1   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/unmerged-RAWSIMoutputGEN-SIM  0   0   Merge
73400   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/MUO-RunIIFall18GS-00027_0CleanupUnmergedRAWSIMoutput  1   /pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/unmerged-RAWSIMoutputGEN-SIM  0   0   Cleanup
amaltaro commented 5 years ago

Looking at the ParentlessMergeBySize algorithm, I noticed it doesn't see any of those files in available. And the reason is that those 2 files have no location associated with them, see:

select * from wmbs_sub_files_available
  INNER JOIN wmbs_file_runlumi_map ON wmbs_sub_files_available.fileid = wmbs_file_runlumi_map.fileid
  INNER JOIN wmbs_file_location ON wmbs_sub_files_available.fileid = wmbs_file_location.fileid
  where subscription = 73398;

Going deeper in one of those job reports, indeed there is no location and the Report getAllFiles() method returns the following data:

>>> fileList = jobReport.getAllFiles()
>>> pprint(fileList)
[{'acquisitionEra': 'RunIIFall18GS',
  'branch_hash': '6853840b45c575051050a6e81a59f2d0',
  'branches': [],
  'catalog': '',
  'checksums': {'adler32': '6335e421', 'cksum': '1333373001'},
  'configURL': 'https://cmsweb.cern.ch/couchdb;;reqmgr_config_cache;;1126296323ab39f89913b87b1825c605',
  'dataset': {'applicationName': 'cmsRun',
              'applicationVersion': 'CMSSW_10_2_15_patch2',
              'dataTier': 'GEN-SIM',
              'primaryDataset': 'MinBias_TuneCP5_13TeV_NeutronHP-pythia8',
              'processedDataset': 'RunIIFall18GS-SNB_102X_upgrade2018_realistic_v17-v1'},
  'events': 195,
  'fileRef': <WMCore.Configuration.ConfigSection object at 0x7f750fcd0890>,
  'first_event': 0,
  'globalTag': '102X_upgrade2018_realistic_v17',
  'guid': '9C86D47B-110A-8446-BF89-4698B179E5E0',
  'input': [],
  'inputPath': None,
  'inputpfns': [],
  'last_event': 0,
  'lfn': '/store/unmerged/RunIIFall18GS/MinBias_TuneCP5_13TeV_NeutronHP-pythia8/GEN-SIM/SNB_102X_upgrade2018_realistic_v17-v1/40008/9C86D47B-110A-8446-BF89-4698B179E5E0.root',
  'locations': set([None]),
  'merged': False,
  'module_label': 'RAWSIMoutput',
  'outputModule': 'RAWSIMoutput',
  'parents': set([]),
  'pfn': '/srv/job/WMTaskSpace/cmsRun1/RAWSIMoutput.root',
  'prep_id': 'MUO-RunIIFall18GS-00027',
  'processingVer': 1,
  'runs': set([<WMCore.DataStructs.Run.Run object at 0x7f750fcd0390>]),
  'size': 68012023,
  'validStatus': 'PRODUCTION'},
 {'acquisitionEra': None,
  'branch_hash': None,
  'branches': [],
  'catalog': '',
  'checksums': {'adler32': 'f5148aaa', 'cksum': '206908913'},
  'configURL': None,
  'dataset': {},
  'events': 0,
  'fileRef': <WMCore.Configuration.ConfigSection object at 0x7f750fcd0ed0>,
  'first_event': 0,
  'globalTag': None,
  'guid': '',
  'input': [],
  'inputPath': None,
  'inputpfns': [],
  'last_event': 0,
  'lfn': '/store/unmerged/logs/prod/2019/7/12/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/0008/1/0fa2ce62-7299-459c-a995-bb47080bd0eb-854-1-logArchive.tar.gz',
  'locations': set(['T2_CH_CSCS']),
  'merged': False,
  'module_label': 'logArchive',
  'outputModule': 'logArchive',
  'parents': set([]),
  'pfn': 'srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/logs/prod/2019/7/12/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/MUO-RunIIFall18GS-00027_0/0008/1/0fa2ce62-7299-459c-a995-bb47080bd0eb-854-1-logArchive.tar.gz',
  'prep_id': None,
  'processingVer': None,
  'runs': set([]),
  'size': 0,
  'validStatus': None}]
>>> 

while the logArchive tarball has the proper location set (puzzled!!!).

amaltaro commented 5 years ago

I compared the job report against other successful jobs and indeed it's missing the location attribute, and that's why that subscription got stuck and no jobs get created.

Now looking at the runtime code, we do set the location attribute there: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/StageOut.py#L202

which uses the PNN value set here: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/Storage/StageOutMgr.py#L205

So I fail to see why it was not set for those 2 jobs...

Another possible fix would be to guarantee that JobAccountant will fail such jobs if they are missing an output's file location. I'll check that approach.

amaltaro commented 5 years ago

I just found the same problem, for the same workflow, for the same site, on vocms0280, job is: install/wmagent/JobArchiver/logDir/p/pdmvserv_task_MUO-RunIIFall18GS-00027__v1_T_190627_203802_3573/JobCluster_547/Job_547963.tar.bz2

I have set the output unmerged file location to T2_CH_CSCS and a job got created.

amaltaro commented 5 years ago

While trying to get vocms0283 completely drained, I noticed that the agent has a few merge files available for job creation (and a bunch other for other subscriptions) for this workflow: pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558

which has 40 LQE in Running state at the moment. Apparently the bug fix in #9354 is only available in the latest stable release (1.2.6).

amaltaro commented 5 years ago

So, either I'm following the wrong approach, or this process to map an output file to a job is overly complicated (one could possible use WMArchive as well).

Here is what I did on vocms0283.

  1. Get a map of workflow task, subscription id, fileset, number of files available for each subscription and so on (one has to provide the workflow name in the query):

    SELECT wmbs_subscription.id AS subId, wmbs_workflow.task AS wfName, wmbs_workflow.injected As wfInj, wmbs_fileset.name AS fsetName,
    wmbs_fileset.open AS fsetOpen, wmbs_subscription.finished AS subFin, wmbs_sub_types.name AS subType, count(wmbs_sub_files_available.fileid) AS filesAvail
    FROM wmbs_subscription
    INNER JOIN wmbs_sub_types ON wmbs_subscription.subtype = wmbs_sub_types.id
    INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
    INNER JOIN wmbs_fileset ON wmbs_subscription.fileset = wmbs_fileset.id
    INNER JOIN wmbs_sub_files_available ON wmbs_sub_files_available.subscription = wmbs_subscription.id
    WHERE wmbs_workflow.name='pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558'
    AND wmbs_subscription.finished = 0
    GROUP BY wmbs_subscription.id, wmbs_workflow.task, wmbs_workflow.injected, wmbs_fileset.name, wmbs_fileset.open,
    wmbs_subscription.finished, wmbs_sub_types.name;
  2. We're mostly interested in the Merge subscription type, so we look at that row and from the workflow task name

    /pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1/unmerged-AODSIMoutputAODSIM

    we extract the task that created those files that are now available, the task path is:

    /pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1
  3. Using the subscription id above, we list the file id and run/lumi information:

    select wsfa.fileid, wfrm.run, wfrm.lumi  from wmbs_sub_files_available wsfa
    inner join wmbs_file_details ON wmbs_file_details.id = wsfa.fileid
    inner join wmbs_file_runlumi_map wfrm ON wfrm.fileid = wmbs_file_details.id
    where wsfa.subscription=33544 ORDER BY wfrm.lumi;
  4. Now the painful part, given the task path and one of the lumis from the previous command, we can check for jobs that were processing lumis between a given range, e.g.:

    SELECT wmbs_job_mask.*, wmbs_workflow.task FROM wmbs_job_mask
    INNER JOIN wmbs_job ON wmbs_job.id = wmbs_job_mask.job
    INNER JOIN wmbs_jobgroup ON wmbs_job.jobgroup = wmbs_jobgroup.id
    INNER JOIN wmbs_subscription ON wmbs_jobgroup.subscription = wmbs_subscription.id
    INNER JOIN wmbs_workflow ON wmbs_subscription.workflow = wmbs_workflow.id
    WHERE wmbs_workflow.task = '/pdmvserv_task_SMP-RunIISummer15GS-00269__v1_T_190813_093712_4558/SMP-RunIISummer15GS-00269_0/SMP-RunIISummer16DR80-00067_0/SMP-RunIISummer16DR80-00067_1'
    AND wmbs_job_mask.firstlumi >= 8541 and lastlumi <=8557;

then we finally match the lumis on the files available against the lumi range assigned to jobs.

So I started looking for a job that processed the following fileid / lumi:

10268555    8541
10268555    8542

which mapped to job 3550823.

Looking at the job wmagentJob.log, it wrote to T2_CH_CSCS:

2019-09-06 23:10:15,907:INFO:StageOutMgr:===> Stage Out Successful: {'StageOutReport': [], 'LFN': '/store/unmerged/RunIISummer16DR80/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8/AODSIM/NoPU_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/130010/3461849D-FAD0-E911-8FE8-00010100093D.root', 'PFN': 'srm://storage01.lcg.cscs.ch:8443/srm/managerv2?SFN=/pnfs/lcg.cscs.ch/cms/trivcat/store/unmerged/RunIISummer16DR80/JpsiToMuMu_JpsiPt8_TuneCUEP8M1_13TeV-pythia8/AODSIM/NoPU_80X_mcRun2_asymptotic_2016_TrancheIV_v6_ext1-v2/130010/3461849D-FAD0-E911-8FE8-00010100093D.root', 'PNN': 'T2_CH_CSCS', 'Checksums': {'adler32': '3fb0d152', 'cksum': '2258890055'}, 'StageOutCommand': 'gfal2'}

and looking into the job report

>>> import pickle
>>> from WMCore.FwkJobReport.Report import Report
>>> from pprint import pprint
>>> jr = Report()
>>> jr.load("./Job_3550823/Report.1.pkl")
>>> pprint(jr.getAllFiles())

and voila

...
  'locations': set([None]),

...

There must be a problem with that site which generates those no location jobs...

amaltaro commented 5 years ago

I haven't found a smart way to insert multiple rows based on an WHERE IN clause, so here it goes... find the PNN id and the file IDs to be inserted, from:

SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS';
SELECT fileid FROM wmbs_sub_files_available WHERE subscription=33544;

then inserting it

INSERT ALL 
  INTO wmbs_file_location (fileid, pnn) VALUES (10267223, 13)
  INTO wmbs_file_location (fileid, pnn) VALUES (10268555, 13)
  INTO wmbs_file_location (fileid, pnn) VALUES (10274827, 13)
  INTO wmbs_file_location (fileid, pnn) VALUES (10276197, 13)
  INTO wmbs_file_location (fileid, pnn) VALUES (10277661, 13)
  INTO wmbs_file_location (fileid, pnn) VALUES (10284803, 13)
SELECT * FROM dual;
commit;
amaltaro commented 5 years ago

I just fixed a few other files that were missing the T2_CH_CSCS location as well.

Now a much easier (and generic) way to find files sitting in available without a correspondent location in wmbs_file_location (e.g. on submit7):

MariaDB [wmagent]> select * from wmbs_sub_files_available wsfa
    ->  WHERE NOT EXISTS 
    ->    ( SELECT null FROM wmbs_file_location wfl WHERE wfl.fileid = wsfa.fileid);
+--------------+----------+
| subscription | fileid   |
+--------------+----------+
|        24931 | 10697320 |
|        24933 | 10697320 |
+--------------+----------+
2 rows in set (0.17 sec)
amaltaro commented 5 years ago

Files fixed on submit6 as well:

INSERT INTO wmbs_file_location (fileid, pnn) 
  VALUES 
    (6581639, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (12719113, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (12719153, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (13259926, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (14261861, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (14262255, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS')),
    (14319865, (SELECT id from wmbs_pnns WHERE pnn='T2_CH_CSCS'));   
amaltaro commented 5 years ago

Ok, in principle everything is under control now. However, we are still to completely understand how it happens. So my suggestion would be to run a small workflow at T2_CH_CSCS and check whether it's reproducible or not. If so, keep debugging to see why the root file has no locations while the logArchive does.

Just a note, the new agent (1.2.6) should fail such bad jobs at JobAccountant level.

amaltaro commented 2 months ago

While the root cause hasn't been completed understood - and we were not able to reproduce this error in T2_CH_CSCS - a couple of workarounds have been implemented in the past years. The latest one is: https://github.com/dmwm/WMCore/pull/12093

With the patch above (and the one before it), we identify such cases in JobAccountant and make sure such jobs are not accepted in the agent, triggering a job retry if possible.

I am closing this issue out.