Failures in Run 3 data reprocessing

kskovpen commented 1 year ago

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

cmsbuild commented 1 year ago

A new Issue was created by @kskovpen .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

Dr15Jones commented 1 year ago

Assign reconstruction

cmsbuild commented 1 year ago

New categories assigned: reconstruction

@mandrenguyen,@clacaputo you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

A bit more context from the link

Fatal Exception (Exit code: 8001)
An exception of category 'DeepTauId' occurred while
[0] Processing Event run: 357696 lumi: 141 event: 162424424 stream: 0
[1] Running path 'MINIAODoutput_step'
[2] Prefetching for module PoolOutputModule/'MINIAODoutput'
[3] Prefetching for module PATTauIDEmbedder/'slimmedTaus'
[4] Calling method for module DeepTauId/'deepTau2017v2p1ForMini'
Exception Message:
invalid prediction = nan for tau_index = 0, pred_index = 0

The links to logs do not seem to work.

danielwinterbottom commented 1 year ago

Just to add that this might be related to this issue: https://github.com/cms-sw/cmssw/issues/28358 In this case, the problems were found to be non-reproducible crashes due to problems on the site

srimanob commented 1 year ago

Can this issue be closed? Or issue related to something else?

VinInn commented 1 year ago

It would be possible to obtain details on the machine were the crash occurred?

kskovpen commented 1 year ago

It looks like this issue is a bottleneck in finishing the 2023 data reprocessing. The fraction of failures is not negligible, as can be seen here (look for 8001 error codes). Is there still a way to implement a protection for these failures?

@VinInn we are trying to get this info; will let you know, if we manage to dig it out.

kskovpen commented 1 year ago

Now, also looking at the discussion, which happened in https://github.com/cms-sw/cmssw/issues/40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

makortel commented 1 year ago

Now, also looking at the discussion, which happened in #40733 - is our understanding correct that this issue is potentially fixed in 12_6_0_pre5?

No, it is still an exception https://github.com/cms-sw/cmssw/blob/8c3dad4257c96be93fc3c62bd42b83d2207e22f7/RecoTauTag/RecoTau/plugins/DeepTauId.cc#L1296-L1297

makortel commented 1 year ago

@kskovpen Do you have any pointers to the logs of the 8001 failures?

kskovpen commented 1 year ago

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

makortel commented 1 year ago

@kskovpen Do you have any pointers to the logs of the 8001 failures?

Here it is.

Thanks. This failure occurred on Intel(R) Xeon(R) CPU E5645 @ 2.40GHz, which is of Westmere microarchitecture, i.e. SSE-only.

kskovpen commented 1 year ago

This issue appears to be the main bottleneck in the current run3 data reprocessing. The initial cause of the issue could be the excessive memory usage of the deepTau related modules. The issue happens all over the place and there are many examples, e.g. https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660. Was the memory profiling done for the latest deepTau implementation in cmssw?

makortel commented 1 year ago

https://cms-unified.web.cern.ch/cms-unified/showlog/?search=ReReco-Run2022E-ZeroBias-27Jun2023-00001#DataProcessing:50660

Trying to look at the logs for 50660, I see

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/0b0b9315-4516-4a95-b2a1-63dcdaa879c9-24-0-logArchive/
- I see only the invalid prediction = -nan for tau_index = 0, pred_index = 0 exception, on Intel(R) Xeon(R) CPU X5650, which is Westmere microarchitecture, i.e. SSE-only
- The wmagentJob.log has
```
2023-06-28 05:13:16,497:INFO:CMSSW:Executing CMSSW. args: ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']
2023-06-28 05:14:12,850:INFO:PerformanceMonitor:PSS: 1015060; RSS: 759448; PCPU: 50.5; PMEM: 1.1
2023-06-28 05:19:13,174:INFO:PerformanceMonitor:PSS: 12010830; RSS: 6378532; PCPU: 123; PMEM: 9.7
2023-06-28 05:19:13,175:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 12010 MB

2023-06-28 05:19:13,176:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
```
  which is interesting because it claims PSS larger than RSS. I don't understand how that could be. The RSS is reasonable for 4-core job, and quite consistent with the RSS reported in the CMSSW log. If the timestamps of the wmagentJob.log and cmsRun1-stdout.log can be correlated (i.e. their clocks are close enough), the error by WM above is noticed at the time the CMSSW has already terminated the data processing loop and is shutting down.
- Later on the wmagentJob.log shows the CMSSW terminated with exit code 8001.

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/50660/DataProcessing/03b0b56e-f7b7-44ff-a29e-e6033be3b4d6-9-0-logArchive/

The wmagentJob.log has

2023-06-28 06:02:34,738:INFO:PerformanceMonitor:PSS: 549635; RSS: 668964; PCPU: 38.0; PMEM: 0.2
2023-06-28 06:07:34,962:INFO:PerformanceMonitor:PSS: 9424486; RSS: 7618620; PCPU: 233; PMEM: 2.8
2023-06-28 06:12:35,200:INFO:PerformanceMonitor:PSS: 10024095; RSS: 7686548; PCPU: 308; PMEM: 2.9
2023-06-28 06:12:35,200:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 10024 MB

2023-06-28 06:12:35,201:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2

again showing PSS larger than RSS, and RSS being somewhat compatible with the RSS reported in the CMSSW log.

Correlating the timestamps between wmagentJob.log and cmsRun1-stdout.log it seems that CMSSW shut itself down after the SIGUSR2 signal. CMSSW log itself shows no issues. RSS fluctuates between 7.0 and 7.5 GiB.

With these two logs alone I don't see any evidence that deepTau would cause memory issues. The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

makortel commented 1 year ago

The 8001 log (https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022E_ZeroBias_27Jun2023_230627_121019_4766/8001/DataProcessing/04ae17ee-b07e-4ca6-b32e-4e51ee944afe-36-0-logArchive/job/) shows the job throwing the exception was also run on Intel(R) Xeon(R) CPU X5650, i.e. Westmere and SSE-only.

davidlange6 commented 1 year ago

do we have any such reported failures at CERN? (eg, why does the T0 not see this same issue)

kskovpen commented 1 year ago

The failures are strongly correlated with the sites. The highest failure rates are observed at T2_BE_IIHE, T2_US_Nebraska, and T2_US_Caltech.

kskovpen commented 1 year ago

Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305

amaltaro commented 1 year ago

@makortel Hi Matti, for the test workflow that I was running and reported issues like:

invalid prediction = -nan for tau_index = 0, pred_index = 0

it happened at 2 sites only (13 failures at MIT and 3 at Nebraska).

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
and
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

makortel commented 1 year ago

Thanks @amaltaro for the logs.

A couple of logs for MIT area available in CERN EOS:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-2055-0-log.tar.gz

This job failed because the input data from file:/mnt/hadoop/cms/store/data/Run2022D/MuonEG/RAW/v1/000/357/688/00000/819bbdb2-43c0-4393-aa84-4f0a81ad5f9e.root was corrupted (CMSSW log has

R__unzipLZMA: error 9 in lzma_code
----- Begin Fatal Exception 01-Jul-2023 05:39:45 EDT-----------------------
An exception of category 'FileReadError' occurred while
   [0] Processing  Event run: 357688 lumi: 48 event: 85850129 stream: 2
   [1] Running path 'AODoutput_step'
   [2] Prefetching for module PoolOutputModule/'AODoutput'
   [3] While reading from source GlobalObjectMapRecord hltGtStage2ObjectMap '' HLT
   [4] Rethrowing an exception that happened on a different read request.
   [5] Processing  Event run: 357688 lumi: 48 event: 86242374 stream: 0
   [6] Running path 'dqmoffline_17_step'
   [7] Prefetching for module CaloTowersAnalyzer/'AllCaloTowersDQMOffline'
   [8] Prefetching for module CaloTowersCreator/'towerMaker'
   [9] Prefetching for module HBHEPhase1Reconstructor/'hbhereco@cpu'
   [10] Prefetching for module HcalRawToDigi/'hcalDigis'
   [11] While reading from source FEDRawDataCollection rawDataCollector '' LHC
   [12] Reading branch FEDRawDataCollection_rawDataCollector__LHC.
   Additional Info:
      [a] Fatal Root Error: @SUB=TBasket::ReadBasketBuffers
fNbytes = 2802592, fKeylen = 115, fObjlen = 4817986, noutot = 0, nout=0, nin=2802477, nbuf=4817986

----- End Fatal Exception -------------------------------------------------

I'm puzzled why the FileReadError resulted in 8001 exit code instead of 8021, but I'll open a separate issue for that (https://github.com/cms-sw/cmssw/issues/42179).

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-206-0-log.tar.gz

This file doesn't seem to exist.

while for Nebraska they are:

/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1662-0-log.tar.gz
/eos/cms/store/logs/prod/recent/TESTBED/amaltaro_ReReco_2022DLumiMask_June2023_Val_230628_222652_3143/DataProcessing/vocms0193.cern.ch-1668-0-log.tar.gz

These jobs failed with the invalid prediction = nan for tau_index = 0, pred_index = 0 exception. The node had Intel(R) Xeon(R) CPU X5650 CPU, i.e. SSE-only (and thus consistent with the discussion above).

makortel commented 1 year ago

Another example of highly failing workflow: https://cms-unified.web.cern.ch/cms-unified/report/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305

Here

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-0-0-logArchive/ shows the combination of invalid prediction = nan for tau_index = 0, pred_index = 0 exception (CPU is Intel(R) Xeon(R) CPU X5650 i.e. SSE-only), and WM seeing PSS going over the limit, while RSS is much smaller and reasonable for 4-core job.

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022B_HLTPhysics_27Jun2023_230627_115530_2305/50660/DataProcessing/1a49954c-1f29-41bd-aa2f-b19144eead34-2-0-logArchive/ WM sees PSS going over the limit, while RSS is smaller and reasonable for 4-core job.

makortel commented 1 year ago

The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau).

Poking into the WM code, I see the PSS is read from /proc/<PID>/smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command).

But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps)

davidlange6 commented 1 year ago

In case its useful to correlate sites and cpu types - this is what's been running recently...

https://gist.github.com/davidlange6/74232d064422e036c176fb992d90357e

On Jul 3, 2023, at 7:13 PM, Matti Kortelainen @.***> wrote:

The main weirdness to me seems to be PSS becoming larger than RSS leading to WM asking CMSSW to stop processing (in addition of the exception from deepTau). Poking into the WM code, I see the PSS is read from /proc//smaps, and RSS from ps https://github.com/dmwm/WMCore/blob/762bae943528241f67625016fd019ebcd0014af1/src/python/WMCore/WMRuntime/Monitors/PerformanceMonitor.py#L242. IIUC ps uses /proc/PID/stat (which is also what CMSSW's SimpleMemoryCheck printouts uses), and apparently stat and smaps are known to report different numbers (e.g. https://unix.stackexchange.com/questions/56469/rssresident-set-size-is-differ-when-use-pmap-and-ps-command). But is this large (~3 GB, ~30 %) difference expected? (ok, we don't know what would be RSS as reported by smaps) — Reply to this email directly, view it on GitHub, or unsubscribe. You are receiving this because you commented.Message ID: @.***>

davidlange6 commented 1 year ago

I looked at the classed monitoring for the Run2022* rereco and explicitly at exit code 8001. The only cpu models with a particularly high job failure rate are below (>5%) - the last number is the number of jobs in the monitoring info I looked at. There were a total of 571k jobs and an average failure rate of 0.5% into exit code 8001.

lots of 2009/2010 era processors.

65.8 % Intel(R) Xeon(R) CPU L5520 @ 2.27GHz 41 38.5 % Intel(R) Xeon(R) CPU L5640 @ 2.27GHz 651 30.7 % AMD EPYC 7702P 64-Core Processor 39 18.2 % Intel(R) Xeon(R) CPU E5520 @ 2.27GHz 1582 18.1 % Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz 232 17.8 % Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz 73 16.2 % Intel(R) Xeon(R) CPU X5650 @ 2.67GHz 9707 7.3 % Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz 383

davidlange6 commented 1 year ago

and more generally - models with a generally high exit code not 0 rate for the rereco are below (Total of 583k jobs and a 2.5% failure rate)

66.6 % AMD EPYC 7702P 64-Core Processor 81 65.8 % Intel(R) Xeon(R) CPU L5520 @ 2.27GHz 41 62.1 % Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz 502 53.5 % Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz 56 49.5 % Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz 119 38.5 % Intel(R) Xeon(R) CPU L5640 @ 2.27GHz 651 19.7 % AMD EPYC 7452 32-Core Processor 1107 18.9 % Intel(R) Xeon(R) CPU E5520 @ 2.27GHz 1596 18.5 % Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz 436 16.4 % Intel(R) Xeon(R) CPU X5650 @ 2.67GHz 9735 16.1 % Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz 420 15.2 % Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz 491 11.5 % Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz 716 9.2 % Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz 1173 8.6 % Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz 173 7.5 % Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz 519 5.9 % Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz 952 5.8 % AMD EPYC 7282 16-Core Processor 11747 5.2 % Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz 6758

VinInn commented 1 year ago

@davidlange6 what is the normalization for those "%"?

davidlange6 commented 1 year ago

Percentage of jobs that ran on that kind of cpu which failed

VinInn commented 1 year ago

2/3 of the jobs running on 7702P failing? Is it not a "storm" on few faulty nodes? according to your previous table they are only @T1_RU_JINR (others are 7702 (no P)) same for L5520 only @ T2_BE_IIHE

davidlange6 commented 1 year ago

can not eliminate that. I can try to find out how many nodes are involved in the non-0 exit code jobs (I think that can be derived)

davidlange6 commented 1 year ago

waiting for spark to corporate to at least break down by site - but eg, the 7702P in the table above was just 81 out of 583k jobs - so the fact that 2/3rds failed, probably doesn't matter too much...

davidlange6 commented 1 year ago

here is by site - so yes, all the 7702P jobs are at Bari.

+------------------+----------------------------------------------+----------------+------------+
| Failure rate (%) |                     CPU                      |      Site      | Total jobs |
+------------------+----------------------------------------------+----------------+------------+
|      100.0       |  Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz  |   T1_US_FNAL   |     2      |
|      100.0       | Intel(R) Xeon(R) Platinum 8368 CPU @ 2.40GHz |   T2_DE_DESY   |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz   | T2_US_Caltech  |     4      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz   |   T2_DE_DESY   |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2697 v4 @ 2.30GHz   |   T1_US_FNAL   |     17     |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2670 v3 @ 2.30GHz   |   T2_US_MIT    |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz   | T2_US_Nebraska |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz  |   T2_US_MIT    |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz  |   T1_IT_CNAF   |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz   |   T2_US_MIT    |     14     |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz   | T2_US_Caltech  |     14     |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz   | T2_US_Nebraska |     2      |
|      100.0       |  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   |   T1_US_FNAL   |     2      |
|      100.0       |       AMD EPYC 7702 64-Core Processor        |   T1_IT_CNAF   |     3      |
|      100.0       |       AMD EPYC 7551 32-Core Processor        | T2_US_Nebraska |     2      |
|      100.0       |       AMD EPYC 7551 32-Core Processor        | T2_US_Caltech  |     2      |
|      100.0       |       AMD EPYC 7551 32-Core Processor        |   T1_IT_CNAF   |     2      |
|      100.0       |       AMD EPYC 7542 32-Core Processor        | T2_US_Nebraska |     2      |
|      100.0       |       AMD EPYC 7542 32-Core Processor        |   T2_DE_DESY   |     2      |
|      100.0       |       AMD EPYC 7542 32-Core Processor        |   T1_DE_KIT    |     4      |
|      100.0       |       AMD EPYC 7443 24-Core Processor        |   T1_US_FNAL   |     14     |
|      100.0       |       AMD EPYC 7351 16-Core Processor        |   T1_DE_KIT    |     2      |
|      100.0       |       AMD EPYC 7282 16-Core Processor        | T2_US_Caltech  |     2      |
|       66.6       |       AMD EPYC 7702P 64-Core Processor       |   T2_IT_Bari   |     81     |
|       65.8       |     Intel(R) Xeon(R) CPU L5520 @ 2.27GHz     |   T2_BE_IIHE   |     41     |
|       62.1       |  Intel(R) Xeon(R) CPU E5-2650L v4 @ 1.70GHz  |   T2_BE_IIHE   |    502     |
|       51.8       |  Intel(R) Xeon(R) CPU E5-2650L v4@ 1.70GHz   |   T2_BE_IIHE   |     54     |
|       47.8       |  Intel(R) Xeon(R) CPU E5-2650L v3 @ 1.80GHz  |   T2_BE_IIHE   |    115     |
|       38.5       |     Intel(R) Xeon(R) CPU L5640 @ 2.27GHz     | T2_US_Caltech  |    651     |
|       25.9       |       AMD EPYC 7351 16-Core Processor        | T2_IT_Legnaro  |    216     |
|       19.9       |  Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz  | T2_IT_Legnaro  |    341     |
|       19.7       |       AMD EPYC 7452 32-Core Processor        |   T2_BE_IIHE   |    1107    |
|       18.9       |     Intel(R) Xeon(R) CPU E5520 @ 2.27GHz     | T2_US_Nebraska |    1596    |
|       18.8       |  Intel(R) Xeon(R) CPU E5-2650 v3 @ 2.30GHz   | T2_IT_Legnaro  |    281     |
|       17.3       |  Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz   | T2_IT_Legnaro  |    230     |
|       16.4       |     Intel(R) Xeon(R) CPU X5650 @ 2.67GHz     | T2_US_Nebraska |    9735    |
|       15.7       |  Intel(R) Xeon(R) CPU E5-2630 v3 @ 2.40GHz   |   T1_IT_CNAF   |    418     |
|       15.2       |  Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz  |   T1_IT_CNAF   |    491     |
|       15.0       |  Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz   | T2_IT_Legnaro  |    179     |
|       13.6       |  Intel(R) Xeon(R) Silver 4114 CPU @ 2.20GHz  |   T2_BE_IIHE   |     95     |
|       13.0       |       AMD EPYC 7282 16-Core Processor        | T2_IT_Legnaro  |    3010    |
|       11.5       |   Intel(R) Xeon(R) Gold 5120 CPU @ 2.20GHz   | T2_IT_Legnaro  |    716     |
|       10.4       |       AMD EPYC 7313 16-Core Processor        |   T1_IT_CNAF   |    1465    |
|       8.9        |  Intel(R) Xeon(R) Silver 4116 CPU @ 2.10GHz  |   T2_BE_IIHE   |    146     |
|       7.5        |   Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz   | T2_US_Caltech  |    519     |
|       7.3        |  Intel(R) Xeon(R) CPU E5-2640 v3 @ 2.60GHz   |   T2_DE_DESY   |    941     |
|       6.9        |        AMD Opteron(tm) Processor 6378        |   T2_DE_DESY   |    810     |
|       5.9        |  Intel(R) Xeon(R) CPU E5-2450 v2 @ 2.50GHz   |   T2_DE_DESY   |    952     |
|       5.3        |        AMD Opteron(tm) Processor 6376        |   T1_US_FNAL   |    3364    |
|       5.1        |  Intel(R) Xeon(R) CPU E5-2640 v4 @ 2.40GHz   |   T2_DE_DESY   |    6743    |
+------------------+----------------------------------------------+----------------+------------+

VinInn commented 1 year ago

btw I've made several test with various cms jobs and never seen any inconsistency in memory report used this (pid set by hand..) awk '/^Rss/ {pss += $2} END {print pss}' /proc/164691/smaps ; awk '/^Pss/ {pss += $2} END {print pss}' /proc/164691/smaps ; ps -v 164691

PSS can be larger than RSS only because read at different time: not by 3GB! So the report from one of the job above it is really weird (consistent report of PSS >> RSS) and must be investigated. It may be worth asking WM to dump the whole content of smaps (and statm) in the event a job is going to be killed

drkovalskyi commented 1 year ago

I looked at a random failure and it seems to me that PSS vs RSS is a red herring: https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/wmagentJob.log The job used about 10GB of RSS when it reached 10GB of PSS.

Is the memory consumption in the current round of re-reco reproducible?

sextonkennedy commented 1 year ago

I was talking to Brian B. at the OSG meeting today. He suggested a mechanism that would explain the semi-reproducible nature of this problem, and why it happens more at Nebraska. At Nebraska they have set up their c-groups such that as soon as the process uses one byte over what was requested, it gets killed. I know from the past that other sites are more forgiving and it has been possible that if the watcher process doesn't check often enough, a process can spike up in use of memory, release it, and avoid being killed. Brian told me that Nebraska does this specifically to make problems more reproducible.

sextonkennedy commented 1 year ago

I agree with Vincenzo that the RSS / PSS inconsistency can only be due to reading them at different times during the process. However is 3GB really so unbelievably large? How big is the DeepTau ML model? Can someone in the CORE group say how that is accessed during the event processing? I second the call for a dump of smaps right before the kill, maybe send one of the failures to Nebraska so that we know it will happen?

VinInn commented 1 year ago

@drkovalskyi : is there a recipe on how to reproduce exactly that job?

drkovalskyi commented 1 year ago

Job details are at: https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/

@z4027163 should be able to provide more information

danielwinterbottom commented 1 year ago

From TauPOG side we will discuss this in our meeting this afternoon to try and find a solution. In the meantime do you know if the crashed are always/usually related to DeepTauv2p1? We have a more up-to-date version (v2p5) but were keeping v2p1 as a backup. But if it is really problematic we could conciser removing v2p1, although if v2p5 has the same problems then there might not be much point

VinInn commented 1 year ago

Job details are at: https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/ @z4027163 should be able to provide more information

A ready to run conf file would be useful.

davidlange6 commented 1 year ago

the PSet.py in CMSSW_12_4_14_patch1 should be ready to run (modulo input data availability..)

mandrenguyen commented 1 year ago

I copied the PSet.py here for convenience: /afs/cern.ch/work/m/mnguyen//public/tauCrash/

Running it on lxplus I didn't find a particularly high RSS from the simple memory checker. Removing deepTau from miniAOD didn't seem to change the memory footprint quoted by that tool much. Based on comments above though, I guess a more granular tool is needed.

VinInn commented 1 year ago

igprof?

set ver=${2}_${CMSSW_VERSION}_${SCRAM_ARCH}
rm ${1}_${ver}_mem.out
#taskset -c 2
igprof -mp -z -o ${1}_${ver}_mem.gz -t cmsRunGlibC cmsRunGlibC ${1}.py > &  ${1}_${ver}_mem.out
igprof-analyse --value normal -s -v -d --gdb ${1}_${ver}_mem.gz | sqlite3 ${1}_${ver}_memTot.sql3
igprof-analyse --value peak -r MEM_LIVE -s -v -d --gdb ${1}_${ver}_mem.gz | sqlite3 ${1}_${ver}_memPeak.sql3
mv ${1}_${ver}_mem*.sql3 ~/www/perfResults/data/.

davidlange6 commented 1 year ago

The issue is to look also at pss. No one has seen a problem with rss.

drkovalskyi commented 1 year ago

@davidlange6 my example has ~10GB RSS

@mandrenguyen what memory consumption do you see? Have you been able to process both files (/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root and /store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root)?

Overall I don't see any obvious issue beside us using more memory than was requested. Tier0 was able to handle the data without any issues with 8 cores and 16GB. 4 cores with 10 GB can be just not enough.

VinInn commented 1 year ago

I'm trying to process those files on lxplus8 now is ok (apparenlty my first voms_init did not work

VinInn commented 1 year ago

It did nothing! I modified the PSet from @mandrenguyen

cat PSet.py
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.MINIAODoutput.outputCommands.extend(['drop *_slimmedTaus_*_*','keep *_slimmedTausBoosted_*_*'])
process.source.fileNames = cms.untracked.vstring(
'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root'
,'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root'
)

and did nothing just open and close the files

kandrosov commented 1 year ago

If crashes are due to high memory consumption, could it be related to whatever DeepTau is evaluated for all taus in the event at once or one tau at a time? The batch evaluation was introduced in this PR https://github.com/cms-sw/cmssw/pull/28128 to improve timing performance. The downside of this modification is that it requires approximately n_taus times more memory to store inputs. Could this be an issue for busy events?

VinInn commented 1 year ago

@drkovalskyi example is killed just after opening the second file....

davidlange6 commented 1 year ago

Indeed, your example is not representative of the issue that this thread has discussed... Maybe there is an example of one with a divergent rss and pss?

2023-06-30 06:05:59,681:INFO:PerformanceMonitor:PSS: 10208008; RSS: 10214928; PCPU: 266; PMEM: 1.9

Otherwise, you raise a good point - why 4 cores?

cms-sw / cmssw

Failures in Run 3 data reprocessing #40437