dmwm / WMCore

Core workflow management components for CMS.
Apache License 2.0
46 stars 107 forks source link

Error running cmsRun but CMSSW exit code 0 #10749

Open amaltaro opened 3 years ago

amaltaro commented 3 years ago

Impact of the bug WMAgent

Describe the bug Filing this issue because it's the second time I spot it (in different releases), apparently not related to the python3 agent.

Problem is with the cmsRun subprocess, which returns with an exit code different than 0, but when we parse the job report for the exit code and error message, the code it finds is actually 0 (successful cmsRun process!). Here is a snippet of the job logs [1].

How to reproduce it I do not think it's reproducible, but this issue popped up with Py3 WMAgent 1.5.2.pre2 and the test workflow template DMWM/TC_PY3_ProdTTbar.json

Expected behavior If the subprocess call returned with an exit code != 0, we should make sure this gets properly reported. Then we can also parse the job report and report the exit code from there as well.

We might decide to run some debugging code to catch this inconsistency as well, such that we can find what the actual root cause is.

This is likely the place to have these changes done: https://github.com/dmwm/WMCore/blob/master/src/python/WMCore/WMSpec/Steps/Executors/CMSSW.py#L280

Additional context and error message [1]

2021-08-13 04:27:51,402:INFO:Watchdog:Set Watchdog interval to 300
2021-08-13 04:27:51,402:INFO:Watchdog:Initializing monitor PerformanceMonitor
2021-08-13 04:27:51,421:INFO:WMTweak:Not resizing job
2021-08-13 04:27:51,421:INFO:Watchdog:Watchdog modified: False. Final settings:
2021-08-13 04:27:51,421:INFO:Watchdog:  hardTimeout: 129900
2021-08-13 04:27:51,421:INFO:Watchdog:  softTimeout: 129600
2021-08-13 04:27:51,421:INFO:Watchdog:  maxPSS: 6000
2021-08-13 04:27:51,421:INFO:Watchdog:  cores: 4
2021-08-13 04:27:51,421:INFO:Watchdog:MonitorThread: JobStarted
2021-08-13 04:27:51,434:INFO:Executor:Steps.Executor logging started
2021-08-13 04:27:51,459:INFO:CMSSW:Steps.Executors.CMSSW.pre called
2021-08-13 04:27:51,482:INFO:CMSSW:Steps.Executors.CMSSW.execute called
2021-08-13 04:27:51,483:INFO:CMSSW:User files are
2021-08-13 04:27:51,483:INFO:CMSSW:User sandboxes are
2021-08-13 04:27:51,484:INFO:CMSSW:CMSSW configured for 4 cores and 2 event streams
2021-08-13 04:27:51,485:INFO:CMSSW:Executing CMSSW step
2021-08-13 04:27:51,485:INFO:CMSSW:Runing SCRAM
2021-08-13 04:27:57,574:INFO:CMSSW:Running PRE scripts
2021-08-13 04:27:57,577:INFO:CMSSW:    Invoking command:
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH
/cvmfs/cms.cern.ch/COMP/slc7_amd64_gcc630/external/python3/3.8.2-comp/bin/python3 -m WMCore.WMRuntime.ScriptInvoke WMTaskSpace.cmsRun1 SetupCMSSWPset

2021-08-13 04:31:34,208:INFO:CMSSW:RUNNING SCRAM SCRIPTS
2021-08-13 04:31:34,209:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_3_0_pre2_PY3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']
2021-08-13 04:32:51,586:INFO:PerformanceMonitor:PSS: 858786; RSS: 855572; PCPU: 75.4; PMEM: 2.8
2021-08-13 04:37:53,203:INFO:PerformanceMonitor:PSS: 4658643; RSS: 4655624; PCPU: 246; PMEM: 15.5
2021-08-13 04:42:55,663:INFO:PerformanceMonitor:PSS: 4649259; RSS: 4651844; PCPU: 307; PMEM: 15.5
2021-08-13 04:47:56,901:INFO:PerformanceMonitor:PSS: 4650546; RSS: 4647136; PCPU: 330; PMEM: 15.4
2021-08-13 04:52:58,158:INFO:PerformanceMonitor:PSS: 4672596; RSS: 4675060; PCPU: 342; PMEM: 15.5
2021-08-13 04:57:58,419:INFO:PerformanceMonitor:PSS: 3547527; RSS: 3546960; PCPU: 343; PMEM: 11.8
2021-08-13 04:58:02,405:INFO:Report:addOutputFile method called with outputModule: DQMoutput, aFile: None
2021-08-13 04:58:02,406:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,406:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-08-13 04:58:02,406:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,407:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-08-13 04:58:02,407:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,422:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'slc7_amd64_gcc900', 'scramv1', 'CMSSW', 'CMSSW_11_3_0_pre2_PY3', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '', '', '']}
CMSSW Return code: 0

2021-08-13 04:58:02,422:CRITICAL:CMSSW:Error message: None
2021-08-13 04:58:02,641:INFO:Report:addOutputFile method called with outputModule: DQMoutput, aFile: None
2021-08-13 04:58:02,642:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,642:INFO:Report:addOutputFile method called with outputModule: MINIAODSIMoutput, aFile: None
2021-08-13 04:58:02,642:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,643:INFO:Report:addOutputFile method called with outputModule: RECOSIMoutput, aFile: None
2021-08-13 04:58:02,643:INFO:Report:addOutputFile method fileRef: , whole tree: {}
2021-08-13 04:58:02,649:INFO:CMSSW:Steps.Executors.CMSSW.post called
2021-08-13 04:58:02,649:INFO:ExecuteMaster:Post Executor Task Change: logArch1
2021-08-13 04:58:02,650:INFO:ExecuteMaster:StepName: cmsRun1, StepType: CMSSW, with result: 'logArch1'
2021-08-13 04:58:02,760:INFO:Executor:Steps.Executor logging started
2021-08-13 04:58:02,768:INFO:StageOut:Steps.Executors.StageOut.pre called
amaltaro commented 3 years ago

This error seems to be quite rare, from my perspective, hence not adding it to the Q3 and keeping it as medium priority.