dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Report input files and lumi range for failed jobs in (T0) wmstats #9043

Open andresfelquintero opened 5 years ago

andresfelquintero commented 5 years ago

Slava requested that we change this splitting to include the run/lumi information in the job name to help with debugging. He also requested some easier way to display run/lumi information for jobs in WMStats in general such as having a lumi number or range in the failed jobs IDs. This has been discussed on JIRA ticket https://its.cern.ch/jira/browse/CMSTZ-248 .

amaltaro commented 5 years ago

Andres, you have basically two requests in this GH issue: a) change the job id adding run/lumi information to it. This is not going to work, reason being that we don't control how many lumi sections and or lumi ranges each job can process, thus it will make the job ID length variable in a scale that we do not control. b) display the run/lumi information in wmstats (or via a wmstats REST API): this one looks more reasonable and I actually thought we had this information already, however, I see that information empty in the Production WMStats.

Do we have such information in the T0 wmstats? Can you point me to a workflow with paused jobs in a replay instance?

BTW, Repack jobs don't have a job Mask information, so we don't know which lumi sections and/or events will come out of those jobs:

'mask': {'LastRun': None, 'LastLumi': None, 'FirstRun': None, 'inclusivemask': True, 'runAndLumis': {}, 'LastEvent': None, 'FirstEvent': None, 'jobID': 1051, 'FirstLumi': None}
amaltaro commented 5 years ago

WMStats has both Input files and Lumis field to report such information. We have to investigate why it's not displaying those details and fix it. Maybe that would be enough for starting.

hufnagel commented 5 years ago

Option a would only be considered a workaround if option b) is not feasible or takes too long. We could probably devise a "uuid-lumiinfo" naming scheme that would work for option a (who cares how long the job name is and whether it's variable length or fixed...), but if option b is in principle supposed to be available, I'd rather go that direction.

We mostly care about PromptReco jobs here, but we can also fix the Repack (and maybe Express) jobs to add a lumi mask if that makes the monitoring more consistent.

amaltaro commented 5 years ago

(who cares how long the job name is and whether it's variable length or fixed...)

Don't forget we still use a relational database as a backend and it defines a schema for the wmbs_job table.

BTW, when would it be desirable to have this feature in the system? Is it only for Run3?

hufnagel commented 5 years ago

Slava asked for it and it would be to help debugging Tier0 jobs. So yes, mostly Run3. Could also be useful for debugging ReReco jobs though, which would mean later this year. But maybe this already works in WMStats and for some reason just not in the Tier0 WMStats?

amaltaro commented 5 years ago

Naah, I see the same problem in the production wmstats. However, I'm pretty sure there are cases that that information gets properly displayed too. Thanks, Dirk. I'm setting its milestones to somewhere mid of this year.

amaltaro commented 5 years ago

Dirk and Andres, I'm updating the subject of this issue to reflect what was discussed here.

For the record, this request amaltaro_TaskChain_PUMCRecyc_HG1805_Validation_180426_130328_6844 in testbed has "valid" content in the Input files and Lumis fields in WMStats. Only possible problem is that that belongs to a successful job (after a retry). Check that for further info...

And this workflow amaltaro_StepChain_ReDigi3_HG1903_Validation_190304_090531_6089 also has the correct data in there, but it prints the PU input files as well. Maybe we could drop those somehow from the WMStats job report.

todor-ivanov commented 1 year ago

Hi @germanfgv @jhonatanamado @amaltaro,

Let me see if I can grasp the goal of this issue correctly. Here is one T0 PromptReco request PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703, which has failed jobs in it. One may look at the CouchDB record for it's failed jobs at this link:

https://cmsweb-testbed.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703

And as far as I can see with every failed job there is a lumis fieled left blank, at which you'd want to have the information for all lumis this job was working on. Is my understanding correct so far?

germanfgv commented 1 year ago

yes, that's correct @todor-ivanov

todor-ivanov commented 1 year ago

hi German, While working on that and trying to observe the issue with an agent in production, I kind of found this feature is working well in the production system. E.g. : https://cmsweb.cern.ch/wmstatsserver/data/jobdetail/cmsunified_task_EGM-Run3Winter23Digi-00057__v1_T_230511_075100_3810

It seems the lumi lists for the failed jobs are present in this case.

amaltaro commented 1 year ago

I just noticed this hasn't been considered in Q2, so we should probably pause this investigation for now and re-evaluate it for Q3. @todor-ivanov

todor-ivanov commented 1 year ago

Hi @germanfgv , while working with the above mentioned workflow it is indeed missing the lumi lists for broken jobs in t0reqmon: [1]

But it seems to be having them all listed in the workflow summary here: [2]

Wouldn't that suffice?

FYI: @amaltaro

[1] https://cmsweb-testbed.cern.ch/t0_reqmon/data/jobdetail/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703

PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703:

    /PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco:
        jobfailed:
            8020:
                T2_CH_CERN:
                    errorCount: 2024
                    samples:
                            _id: "124e73e3-1c2f-48f3-8947-8352367bf54e-0"
                            _rev: "19-9e43904494b761cfd799a1d893253270"
                            wmbsid: 22748
                            type: "jobsummary"
                            retrycount: 3
                            workflow: "PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703"
                            task: "/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco"
                            jobtype: "Processing"
                            state: "jobfailed"
...
                            lumis:
                            outputdataset:
                            inputfiles: 

[2] https://cmsweb-testbed.cern.ch/couchdb/t0_workloadsummary/_design/WorkloadSummary/_show/histogramByWorkflow/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703

PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703 Summary
No Output
Histogram :
Errors:

    /PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco
        cmsRun1
                exit code: 8020
                details:

                 An exception of category 'FileOpenError' occurred while
                   [0] Constructing the EventProcessor
                   [1] Constructing input source of type PoolSource
                   [2] Calling RootInputFileSequence::initTheFile()
                   [3] Calling StorageFactory::open()
                   [4] Calling XrdFile::open()
                Exception Message:
                Failed to open the file 'root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0'
                   Additional Info:
                      [a] Input file root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0 could not be opened.
                      [b] XrdCl::File::Open(name='root://eoscms.cern.ch//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0', flags=0x10, permissions=0660) => error '[ERROR] Server responded with an error: [3011] Unable to open file /eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root; No such file or directory
                ' (errno=3011, code=400). No additional data servers were found.
                      [c] Last URL tried: root://eoscms.cern.ch:1094//eos/cms/tier0/store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root?eos.app=cmst0&tried=&xrdcl.requuid=ed0fe7eb-03a1-4548-92c7-19268716c3b1
                      [d] Problematic data server: eoscms.cern.ch:1094
                      [e] Disabled source: eoscms.cern.ch:1094

                type:

                 Fatal Exception

                jobs: 12221
                run and lumi range
                    349840
                        lumi range: [1,3399 - 1,3399] 
                input
                    : /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/a7460ec2-1a13-4f4b-8493-28ee236c5422.root
                    : /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/f1877c7b-c345-45bf-8cc8-c47bdc76715a.root
                    : /store/backfill/1/data/Tier0_REPLAY_2022/Cosmics/RAW/v429/000/349/840/00000/d69380e2-fc23-4df6-9893-80b6475cee8e.root
...
todor-ivanov commented 1 year ago

Here follow few more observations and one helpful document added to the troubleshooting wiki of WMCore: [1]

While working withthe T0 workflows I also checked the Production Validation and I made an interesting discovery:

[1] https://github.com/dmwm/WMCore/wiki/trouble-shooting#unpikling-a-failed-job-pset-file-from-logreports

[2] https://cmsweb-testbed.cern.ch/wmstatsserver/data/jobdetail/tivanov_SC_LumiMask_Rules_June2023_Val_230705_172547_6951

germanfgv commented 1 year ago

But it seems to be having them all listed in the workflow summary here: [2] Wouldn't that suffice?

That's exactly the information we need. How can we get that info visualized in WMStats?

  • While at the some time the equivalent visualization with javascript supposed to display the failed job summary at wmstats was giving some inadequate lists of [0] (see the print screen attached)

Exactly, most of the time we get no lumis info in WMStats, but sometimes we get these lists of [0] that don't offer much info.

todor-ivanov commented 1 year ago

Exactly, most of the time we get no lumis info in WMStats, but sometimes we get these lists of [0] that don't offer much info.

I am starting to suspect, the way how this module behaves, strongly depends on the type of failure and the job stage at which it happens.

todor-ivanov commented 1 year ago

Just for logging purposes:

I have double checked all couch views and couchapps in order to prove there is no problem with how we fetch the information related to job details from central CouchDB. And I can tell for sure now - the lumis list is simply not uploaded to central couch neither for failed jobs nor for successful. At least not until the workflow is completed and the workfload summary is generated.

For the purpose I have instanced a WMStatsReader to cmsweb-testbed:

In [1]: from WMCore.Services.WMStats.WMStatsReader import WMStatsReader

In [2]: reqdb_url = 'https://cmsweb-testbed.cern.ch/couchdb/t0_request'

In [3]: wmstats_url = 'https://cmsweb-testbed.cern.ch/couchdb/tier0_wmstats'

In [4]: wmstats = WMStatsReader(wmstats_url, reqdbURL=reqdb_url, reqdbCouchApp="T0Request")

And then directly called for the job info with slight modification to the couchview options here, such that I print the full view per every job, with no aggregation by error etc:

In [5]: requestName = 'PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703'

In [6]: options = {'include_docs': True, 'reduce': False, 'startkey': [requestName], 'endkey': [requestName, {}]}

In [7]: results = wmstats._getCouchView("jobsByStatusWorkflow", options)

Out[7]:
{'offset': 336299,
 'rows': [{'doc': {'_id': '124e73e3-1c2f-48f3-8947-8352367bf54e-0',
                   '_rev': '19-9e43904494b761cfd799a1d893253270',
                   'acdc_url': 'http://localhost:5984/acdcserver',
                   'agent_name': 'vocms0500.cern.ch',
                   'cms_location': 'T2_CH_CERN',
                   'eos_log_url': 'https://eoscmsweb.cern.ch/eos/cms/store/logs/prod/recent/PromptReco/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco/vocms0500.cern.ch-22748-3-log.tar.gz',
                   'errors': {'cmsRun1': [{'details': 'An exception of '
                                                      'category '
                                                      "'FileOpenError' "
...
                                           'exitCode': 8020,
                                           'type': 'Fatal Exception'}],
                              'logArch1': [],
                              'stageOut1': []},
                   'exitcode': 8020,
                   'inputfiles': [],
                   'jobtype': 'Processing',
                   'lumis': [],
                   'output': [{'checksums': {'adler32': '2b344c72',
                                             'cksum': '884341033'},
                               'lfn': '/store/unmerged/data/logs/prod/2022/5/11/PromptReco_Run349840_Cosmics_Tier0_REPLAY_2022_ID220511165314_v429_220511_1703/Reco/0000/3/124e73e3-1c2f-48f3-8947-8352367bf54e-0-3-logArchive.tar.gz',
                               'location': 'T0_CH_CERN_Disk',
                               'size': 0,
                               'type': 'logArchive'}],
                   'outputdataset': {},
                   'retrycount': 3,
                   'site': 'T2_CH_CERN',
                   'state': 'jobfailed',
                   'state_history': [{'location': 'T2_CH_CERN',
                                      'newstate': 'jobcooloff',
                                      'oldstate': 'jobfailed',
                                      'timestamp': 1652288000},
                                     {'location': 'T2_CH_CERN',
                                      'newstate': 'jobcooloff',
...