Open kskovpen opened 1 year ago
A new Issue was created by @kskovpen .
@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
Assign reconstruction
New categories assigned: reconstruction
@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks
A bit more context from the link
Fatal Exception (Exit code: 8001)
An exception of category 'DeepTauId' occurred while
[0] Processing Event run: 357696 lumi: 141 event: 162424424 stream: 0
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[4] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0
The links to logs do not seem to work.
Just to add that this might be related to this issue: https://github.com/cms-sw/cmssw/issues/28358 In this case, the problems were found to be non-reproducible crashes due to problems on the site
Can this issue be closed? Or issue related to something else?
It would be possible to obtain details on the machine were the crash occurred?
It looks like this issue is a bottleneck in finishing the 2023 data reprocessing. The fraction of failures is not negligible, as can be seen here (look for 8001 error codes). Is there still a way to implement a protection for these failures?
@VinInn we are trying to get this info; will let you know, if we manage to dig it out.
Now, also looking at the discussion, which happened in https://github.com/cms-sw/cmssw/issues/40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?
Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?
No, it is still an exception https://github.com/cms-sw/cmssw/blob/8c3dad4257c96be93fc3c62bd42b83d2207e22f7/RecoTauTag/RecoTau/plugins/DeepTauId.cc#L1296-L1297
@kskovpen Do you have any pointers to the logs of the 8001 failures?
@kskovpen Do you have any pointers to the logs of the 8001 failures?
Here it is.
@kskovpen Do you have any pointers to the logs of the 8001 failures?
Here it is.
Thanks. This failure occurred on Intel(R) Xeon(R) CPU E5645 @ 2.40GHz
, which is of Westmere microarchitecture, i.e. SSE-only.
This issue appears to be the main bottleneck in the current run3 data reprocessing. The initial cause of the issue could be the excessive memory usage of the deepTau related modules. The issue happens all over the place and there are many examples, e.g. https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660. Was the memory profiling done for the latest deepTau implementation in cmssw?
Trying to look at the logs for 50660, I see
invalid prediction = -nan for tau_index = 0, pred_index = 0
exception, on Intel(R) Xeon(R) CPU X5650
, which is Westmere microarchitecture, i.e. SSE-onlyThe wmagentJob.log
has
2023-06-28 05:13:16,497:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2023-06-28 05:14:12,850:INFO:PerformanceMonitor:PSS: 1015060; RSS: 759448; PCPU: 50.5; PMEM: 1.1
2023-06-28 05:19:13,174:INFO:PerformanceMonitor:PSS: 12010830; RSS: 6378532; PCPU: 123; PMEM: 9.7
2023-06-28 05:19:13,175:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 12010 MB
2023-06-28 05:19:13,176:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
which is interesting because it claims PSS larger than RSS. I don't understand how that could be. The RSS is reasonable for 4-core job, and quite consistent with the RSS reported in the CMSSW log. If the timestamps of the wmagentJob.log
and cmsRun1-stdout.log
can be correlated (i.e. their clocks are close enough), the error by WM above is noticed at the time the CMSSW has already terminated the data processing loop and is shutting down.
wmagentJob.log
shows the CMSSW terminated with exit code 8001.The wmagentJob.log
has
2023-06-28 06:02:34,738:INFO:PerformanceMonitor:PSS: 549635; RSS: 668964; PCPU: 38.0; PMEM: 0.2
2023-06-28 06:07:34,962:INFO:PerformanceMonitor:PSS: 9424486; RSS: 7618620; PCPU: 233; PMEM: 2.8
2023-06-28 06:12:35,200:INFO:PerformanceMonitor:PSS: 10024095; RSS: 7686548; PCPU: 308; PMEM: 2.9
2023-06-28 06:12:35,200:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 10024 MB
2023-06-28 06:12:35,201:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
again showing PSS larger than RSS, and RSS being somewhat compatible with the RSS reported in the CMSSW log.
wmagentJob.log
and cmsRun1-stdout.log
it seems that CMSSW shut itself down after the SIGUSR2
signal. CMSSW log itself shows no issues. RSS fluctuates between 7.0 and 7.5 GiB.With these two logs alone I don't see any evidence that deepTau would cause memory issues. The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).
The 8001 log (https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/8001/DataProcessing/04ae17ee-b07e-4ca6-b32e-4e51ee944afe-36-0-logArchive/job/) shows the job throwing the exception was also run on Intel(R) Xeon(R) CPU X5650
, i.e. Westmere and SSE-only.
do we have any such reported failures at CERN? (eg, why does the T0 not see this same issue)
The failures are strongly correlated with the sites. The highest failure rates are observed at T2_BE_IIHE, T2_US_Nebraska, and T2_US_Caltech.
Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305
@makortel Hi Matti, for the test workflow that I was running and reported issues like:
invalid prediction = -nan for tau_index = 0, pred_index = 0
it happened at 2 sites only (13 failures at MIT and 3 at Nebraska).
A couple of logs for MIT area available in CERN EOS:
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz
while for Nebraska they are:
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz
Thanks @amaltaro for the logs.
A couple of logs for MIT area available in CERN EOS:
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz
This job failed because the input data from file:/mnt/hadoop/cms/store/data/Run2022D/MuonEG/RAW/v1/000/357/688/00000/819bbdb2-43c0-4393-aa84-4f0a81ad5f9e.root
was corrupted (CMSSW log has
R__unzipLZMA: error 9 in lzma_code
----- Begin Fatal Exception 01-Jul-2023 05:39:45 EDT-----------------------
An exception of category 'FileReadError' occurred while
[0] Processing Event run: 357688 lumi: 48 event: 85850129 stream: 2
[1] Running path 'AODoutput_step'
[2] Prefetching for module PoolOutputModule/'AODoutput'
[3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
[4] Rethrowing an exception that happened on a different read request.
[5] Processing Event run: 357688 lumi: 48 event: 86242374 stream: 0
[6] Running path 'dqmoffline_17_step'
[7] Prefetching for module CaloTowersAnalyzer/'AllCaloTowersDQMOffline'
[8] Prefetching for module CaloTowersCreator/'towerMaker'
[9] Prefetching for module HBHEPhase1Reconstructor/'hbhereco@cpu'
[10] Prefetching for module HcalRawToDigi/'hcalDigis'
[11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
[12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
Additional Info:
[a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 2802592, fKeylen = 115, fObjlen = 4817986, noutot = 0, nout=0, nin=2802477, nbuf=4817986
----- End Fatal Exception -------------------------------------------------
I'm puzzled why the FileReadError
resulted in 8001 exit code instead of 8021, but I'll open a separate issue for that (https://github.com/cms-sw/cmssw/issues/42179).
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz
This file doesn't seem to exist.
while for Nebraska they are:
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz /eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz
These jobs failed with the invalid prediction = nan for tau_index = 0, pred_index = 0
exception. The node had Intel(R) Xeon(R) CPU X5650
CPU, i.e. SSE-only (and thus consistent with the discussion above).
Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305
Here
https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-0-0-logArchive/
shows the combination of invalid prediction = nan for tau_index = 0, pred_index = 0
exception (CPU is Intel(R) Xeon(R) CPU X5650
i.e. SSE-only), and WM seeing PSS going over the limit, while RSS is much smaller and reasonable for 4-core job.
https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-2-0-logArchive/ WM sees PSS going over the limit, while RSS is smaller and reasonable for 4-core job.
The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).
Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps
, and RSS from ps
https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps
uses /proc/PID/stat
(which is also what CMSSW's SimpleMemoryCheck
printouts uses), and apparently stat
and smaps
are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command).
But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps
)
In case its useful to correlate sites and cpu types - this is what's been running recently...
https://gist.github.com/davidlange6/74232d064422e036c176fb992d90357e
On Jul 3, 2023, at 7:13 PM, Matti Kortelainen @.***> wrote:
The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau). Poking into the WM code, I see the PSS is read from /proc/
/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command). But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>
I looked at the classed monitoring for the Run2022* rereco and explicitly at exit code 8001. The only cpu models with a particularly high job failure rate are below (>5%) - the last number is the number of jobs in the monitoring info I looked at. There were a total of 571k jobs and an average failure rate of 0.5% into exit code 8001.
lots of 2009/2010 era processors.
65.8 % Intel(R) Xeon(R) CPU L5520 @ 2.27GHz 41 38.5 % Intel(R) Xeon(R) CPU L5640 @ 2.27GHz 651 30.7 % AMD EPYC 7702P 64-Core Processor 39 18.2 % Intel(R) Xeon(R) CPU E5520 @ 2.27GHz 1582 18.1 % Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz 232 17.8 % Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz 73 16.2 % Intel(R) Xeon(R) CPU X5650 @ 2.67GHz 9707 7.3 % Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz 383
and more generally - models with a generally high exit code not 0 rate for the rereco are below (Total of 583k jobs and a 2.5% failure rate)
66.6 % AMD EPYC 7702P 64-Core Processor 81 65.8 % Intel(R) Xeon(R) CPU L5520 @ 2.27GHz 41 62.1 % Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz 502 53.5 % Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz 56 49.5 % Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz 119 38.5 % Intel(R) Xeon(R) CPU L5640 @ 2.27GHz 651 19.7 % AMD EPYC 7452 32-Core Processor 1107 18.9 % Intel(R) Xeon(R) CPU E5520 @ 2.27GHz 1596 18.5 % Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz 436 16.4 % Intel(R) Xeon(R) CPU X5650 @ 2.67GHz 9735 16.1 % Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 420 15.2 % Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz 491 11.5 % Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz 716 9.2 % Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz 1173 8.6 % Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz 173 7.5 % Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz 519 5.9 % Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz 952 5.8 % AMD EPYC 7282 16-Core Processor 11747 5.2 % Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz 6758
@davidlange6 what is the normalization for those "%"?
Percentage of jobs that ran on that kind of cpu which failed
2/3 of the jobs running on 7702P failing? Is it not a "storm" on few faulty nodes? according to your previous table they are only @T1_RU_JINR (others are 7702 (no P)) same for L5520 only @ T2_BE_IIHE
can not eliminate that. I can try to find out how many nodes are involved in the non-0 exit code jobs (I think that can be derived)
waiting for spark to corporate to at least break down by site - but eg, the 7702P in the table above was just 81 out of 583k jobs - so the fact that 2/3rds failed, probably doesn't matter too much...
here is by site - so yes, all the 7702P jobs are at Bari.
+------------------+----------------------------------------------+----------------+------------+
| Failure rate (%) | CPU | Site | Total jobs |
+------------------+----------------------------------------------+----------------+------------+
| 100.0 | Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz | T1_US_FNAL | 2 |
| 100.0 | Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz | T2_DE_DESY | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz | T2_US_Caltech | 4 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz | T2_DE_DESY | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz | T1_US_FNAL | 17 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz | T2_US_MIT | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz | T2_US_Nebraska | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz | T2_US_MIT | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz | T1_IT_CNAF | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz | T2_US_MIT | 14 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz | T2_US_Caltech | 14 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz | T2_US_Nebraska | 2 |
| 100.0 | Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | T1_US_FNAL | 2 |
| 100.0 | AMD EPYC 7702 64-Core Processor | T1_IT_CNAF | 3 |
| 100.0 | AMD EPYC 7551 32-Core Processor | T2_US_Nebraska | 2 |
| 100.0 | AMD EPYC 7551 32-Core Processor | T2_US_Caltech | 2 |
| 100.0 | AMD EPYC 7551 32-Core Processor | T1_IT_CNAF | 2 |
| 100.0 | AMD EPYC 7542 32-Core Processor | T2_US_Nebraska | 2 |
| 100.0 | AMD EPYC 7542 32-Core Processor | T2_DE_DESY | 2 |
| 100.0 | AMD EPYC 7542 32-Core Processor | T1_DE_KIT | 4 |
| 100.0 | AMD EPYC 7443 24-Core Processor | T1_US_FNAL | 14 |
| 100.0 | AMD EPYC 7351 16-Core Processor | T1_DE_KIT | 2 |
| 100.0 | AMD EPYC 7282 16-Core Processor | T2_US_Caltech | 2 |
| 66.6 | AMD EPYC 7702P 64-Core Processor | T2_IT_Bari | 81 |
| 65.8 | Intel(R) Xeon(R) CPU L5520 @ 2.27GHz | T2_BE_IIHE | 41 |
| 62.1 | Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz | T2_BE_IIHE | 502 |
| 51.8 | Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz | T2_BE_IIHE | 54 |
| 47.8 | Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz | T2_BE_IIHE | 115 |
| 38.5 | Intel(R) Xeon(R) CPU L5640 @ 2.27GHz | T2_US_Caltech | 651 |
| 25.9 | AMD EPYC 7351 16-Core Processor | T2_IT_Legnaro | 216 |
| 19.9 | Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz | T2_IT_Legnaro | 341 |
| 19.7 | AMD EPYC 7452 32-Core Processor | T2_BE_IIHE | 1107 |
| 18.9 | Intel(R) Xeon(R) CPU E5520 @ 2.27GHz | T2_US_Nebraska | 1596 |
| 18.8 | Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz | T2_IT_Legnaro | 281 |
| 17.3 | Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz | T2_IT_Legnaro | 230 |
| 16.4 | Intel(R) Xeon(R) CPU X5650 @ 2.67GHz | T2_US_Nebraska | 9735 |
| 15.7 | Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz | T1_IT_CNAF | 418 |
| 15.2 | Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz | T1_IT_CNAF | 491 |
| 15.0 | Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz | T2_IT_Legnaro | 179 |
| 13.6 | Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz | T2_BE_IIHE | 95 |
| 13.0 | AMD EPYC 7282 16-Core Processor | T2_IT_Legnaro | 3010 |
| 11.5 | Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz | T2_IT_Legnaro | 716 |
| 10.4 | AMD EPYC 7313 16-Core Processor | T1_IT_CNAF | 1465 |
| 8.9 | Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz | T2_BE_IIHE | 146 |
| 7.5 | Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz | T2_US_Caltech | 519 |
| 7.3 | Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz | T2_DE_DESY | 941 |
| 6.9 | AMD Opteron(tm) Processor 6378 | T2_DE_DESY | 810 |
| 5.9 | Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz | T2_DE_DESY | 952 |
| 5.3 | AMD Opteron(tm) Processor 6376 | T1_US_FNAL | 3364 |
| 5.1 | Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz | T2_DE_DESY | 6743 |
+------------------+----------------------------------------------+----------------+------------+
btw
I've made several test with various cms jobs and never seen any inconsistency in memory report
used this (pid set by hand..)
awk '/^Rss/ {pss += $2} END {print pss}' /proc/164691/smaps ; awk '/^Pss/ {pss += $2} END {print pss}' /proc/164691/smaps ; ps -v 164691
PSS can be larger than RSS only because read at different time: not by 3GB! So the report from one of the job above it is really weird (consistent report of PSS >> RSS) and must be investigated. It may be worth asking WM to dump the whole content of smaps (and statm) in the event a job is going to be killed
I looked at a random failure and it seems to me that PSS vs RSS is a red herring: https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/wmagentJob.log The job used about 10GB of RSS when it reached 10GB of PSS.
Is the memory consumption in the current round of re-reco reproducible?
I was talking to Brian B. at the OSG meeting today. He suggested a mechanism that would explain the semi-reproducible nature of this problem, and why it happens more at Nebraska. At Nebraska they have set up their c-groups such that as soon as the process uses one byte over what was requested, it gets killed. I know from the past that other sites are more forgiving and it has been possible that if the watcher process doesn't check often enough, a process can spike up in use of memory, release it, and avoid being killed. Brian told me that Nebraska does this specifically to make problems more reproducible.
I agree with Vincenzo that the RSS / PSS inconsistency can only be due to reading them at different times during the process. However is 3GB really so unbelievably large? How big is the DeepTau ML model? Can someone in the CORE group say how that is accessed during the event processing? I second the call for a dump of smaps right before the kill, maybe send one of the failures to Nebraska so that we know it will happen?
@drkovalskyi : is there a recipe on how to reproduce exactly that job?
@z4027163 should be able to provide more information
From TauPOG side we will discuss this in our meeting this afternoon to try and find a solution. In the meantime do you know if the crashed are always/usually related to DeepTauv2p1? We have a more up-to-date version (v2p5) but were keeping v2p1 as a backup. But if it is really problematic we could conciser removing v2p1, although if v2p5 has the same problems then there might not be much point
Job details are at: https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/ @z4027163 should be able to provide more information
A ready to run conf file would be useful.
the PSet.py in CMSSW_12_4_14_patch1 should be ready to run (modulo input data availability..)
I copied the PSet.py here for convenience: /afs/cern.ch/work/m/mnguyen//public/tauCrash/
Running it on lxplus I didn't find a particularly high RSS from the simple memory checker. Removing deepTau from miniAOD didn't seem to change the memory footprint quoted by that tool much. Based on comments above though, I guess a more granular tool is needed.
igprof?
set ver=${2}_${CMSSW_VERSION}_${SCRAM_ARCH}
rm ${1}_${ver}_mem.out
#taskset -c 2
igprof -mp -z -o ${1}_${ver}_mem.gz -t cmsRunGlibC cmsRunGlibC ${1}.py > & ${1}_${ver}_mem.out
igprof-analyse --value normal -s -v -d --gdb ${1}_${ver}_mem.gz | sqlite3 ${1}_${ver}_memTot.sql3
igprof-analyse --value peak -r MEM_LIVE -s -v -d --gdb ${1}_${ver}_mem.gz | sqlite3 ${1}_${ver}_memPeak.sql3
mv ${1}_${ver}_mem*.sql3 ~/www/perfResults/data/.
The issue is to look also at pss. No one has seen a problem with rss.
@davidlange6 my example has ~10GB RSS
@mandrenguyen what memory consumption do you see? Have you been able to process both files (/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root and /store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root)?
Overall I don't see any obvious issue beside us using more memory than was requested. Tier0 was able to handle the data without any issues with 8 cores and 16GB. 4 cores with 10 GB can be just not enough.
I'm trying to process those files on lxplus8 now is ok (apparenlty my first voms_init did not work
It did nothing! I modified the PSet from @mandrenguyen
cat PSet.py
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
process = pickle.load(handle)
process.MINIAODoutput.outputCommands.extend(['drop *_slimmedTaus_*_*','keep *_slimmedTausBoosted_*_*'])
process.source.fileNames = cms.untracked.vstring(
'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root'
,'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root'
)
and did nothing just open and close the files
If crashes are due to high memory consumption, could it be related to whatever DeepTau is evaluated for all taus in the event at once or one tau at a time? The batch evaluation was introduced in this PR https://github.com/cms-sw/cmssw/pull/28128 to improve timing performance. The downside of this modification is that it requires approximately n_taus
times more memory to store inputs. Could this be an issue for busy events?
@drkovalskyi example is killed just after opening the second file....
Indeed, your example is not representative of the issue that this thread has discussed... Maybe there is an example of one with a divergent rss and pss?
2023-06-30 06:05:59,681:INFO:PerformanceMonitor:PSS: 10208008; RSS: 10214928; PCPU: 266; PMEM: 1.9
Otherwise, you raise a good point - why 4 cores?
We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693
The crash message is:
PdmV