cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Failures in Run 3 data reprocessing #40437

Open kskovpen opened 1 year ago

kskovpen commented 1 year ago

We are seeing failures in the ongoing Run 3 data reprocessing, presumably related to the DeepTau implementation. Here is just one example of the failure: https://cms-unified.web.cern.ch/cms-unified/report/haozturk_ACDC0_Run2022D_BTagMu_10Dec2022_221221_171338_6693

The crash message is:

Exception Message: invalid prediction = nan for tau_index = 0, pred_index = 0

PdmV

mandrenguyen commented 1 year ago

@davidlange6 my example has ~10GB RSS

@mandrenguyen what memory consumption do you see? Have you been able to process both files (/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root and /store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root)?

Overall I don't see any obvious issue beside us using more memory than was requested. Tier0 was able to handle the data without any issues with 8 cores and 16GB. 4 cores with 10 GB can be just not enough.

Ah, I neglected that this was a 4 core job. Here's what I saw from the memory report:

MemoryReport> Peak virtual size 11058.1 Mbytes Key events increasing vsize: [2] run: 355381 lumi: 163 event: 248769791 vsize = 10786.1 deltaVsize = 0.0234375 rss = 5850.15 delta = 102.191 [9] run: 355381 lumi: 163 event: 247962262 vsize = 10898.1 deltaVsize = 112 rss = 5984 delta = 133.844 [46] run: 355381 lumi: 163 event: 248157109 vsize = 11058.1 deltaVsize = 160 rss = 5596.11 delta = -387.883 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [0] run: 0 lumi: 0 event: 0 vsize = 0 deltaVsize = 0 rss = 0 delta = 0 [48] run: 355381 lumi: 163 event: 248125758 vsize = 11058.1 deltaVsize = 0 rss = 5615.21 delta = 19.0977 [47] run: 355381 lumi: 163 event: 248187497 vsize = 11058.1 deltaVsize = 0 rss = 5603.02 delta = 6.90625 [46] run: 355381 lumi: 163 event: 248157109 vsize = 11058.1 deltaVsize = 160 rss = 5596.11 delta = -374.398

davidlange6 commented 1 year ago

its a kill due to excess memory usage.. https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/wmagentJob.log

mandrenguyen commented 1 year ago

It did nothing! I modified the PSet from @mandrenguyen

cat PSet.py
import FWCore.ParameterSet.Config as cms
import pickle
with open('PSet.pkl', 'rb') as handle:
    process = pickle.load(handle)

process.MINIAODoutput.outputCommands.extend(['drop *_slimmedTaus_*_*','keep *_slimmedTausBoosted_*_*'])
process.source.fileNames = cms.untracked.vstring(
'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root'
,'/store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root'
)

and did nothing just open and close the files

Not sure why it's not running, but now I recall that I added by hand: process.MINIAODoutput.outputCommands.extend(['drop *_slimmedTaus_*_*','keep *_slimmedTausBoosted_*_*']) to test removing deepTau. Comment that line to restore the default behavior.

davidlange6 commented 1 year ago

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

reports RSS up to about 8.7GB. (definitely lower than the computing infrastructure sees, but not 5.6GB either)

VinInn commented 1 year ago

@mandrenguyen I did comment and stil not processing events? Where you took PSet.pkl from?

VinInn commented 1 year ago

ah need to change

lumisToProcess = cms.untracked.VLuminosityBlockRange("355381:163-355381:163"),

as well

mandrenguyen commented 1 year ago

ah need to change

lumisToProcess = cms.untracked.VLuminosityBlockRange("355381:163-355381:163"),

as well

Right, I just copied it directly from one of the examples above. Apologies, was trying to be useful, and just making a mess. The best is to just copy it directly for each event.

VinInn commented 1 year ago

ok, just removed th elumi limits. now is running started with just 6.6GB fully consistent PSS RSS etc

VinInn commented 1 year ago

I mean: start at 6.6GB, killed at 10GB: I would call it a memory leak

drkovalskyi commented 1 year ago

https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

reports RSS up to about 8.7GB. (definitely lower than the computing infrastructure sees, but not 5.6GB either)

in https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/wmagentJob.log

I see:

2023-06-30 06:05:59,681:INFO:PerformanceMonitor:PSS: 10208008; RSS: 10214928; PCPU: 266; PMEM: 1.9

So according to the wmagent log the job reached 10GB RSS. Are you considering cmsRun output RSS more reliable?

Dr15Jones commented 1 year ago

cmsRun only reads RSS after a module completes its work. When running multi-threaded, that can be semi-random. Therefore I'd say cmsRun is a bit biased so I'd be more interested in the periodic checks done by WMAgent instead.

valsdav commented 1 year ago

Running the MEM_TOTAL igprof report on the example from @mandrenguyen I see this about tensorflow sessions: https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memTot/47

image


Largest single allocation, MEM_MAX https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memMAX/674

image


Given this profiles, DeepTau doens't seem to have a large footprint on the memory usage. Adding debug info in the job I saw that DeepTau is evaluated only in few events (usually for max 2 candidates). Am I missing something trivial?

slava77 commented 1 year ago

is there a mem_live variant? mem_total is a sum of allocations without subtracting deallocations. mem_max is the max single allocation by a function

valsdav commented 1 year ago

and this is the MEM_LIVE report: https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memPeak/575

image.


These are the reports for the tensorflow session execution part. Then I think I have also found the function that allocates the buffers for the TF outputs and operations.

danielwinterbottom commented 1 year ago

Do the above checks indicate that deepTau is not the causing the memory issues? The reason I ask is that when we looked into this previously we determined that the exception message 'invalid prediction = nan for tau_index = 0, pred_index = 0' occurs when their is excessive memory usage causing Tenorflow to return a nan values, but we never confirmed that the deepTau modules were responsible for the excessive memory usage in the first place. Is it possible that some other module(s) are responsible and the exception message is a red herring as the deepTau module tends to be the first one to crash when there are memory issues? Is it possible to check the memory usage of all other modules to confirm this?

valsdav commented 1 year ago

I ran the igprof on the MiniAOD job linked above, including all the modules. I'm now trying to separate the memory allocation for the TF graph of each producer. I will post an update asap.


Putting here the commands I used to get the sqlite profiles (for experts to verify):

igprof -mp -z -o mem.gz -t cmsRunGlibC cmsRunGlibC PSet.py >  mem.out
igprof-analyse --value peak -r MEM_LIVE -s -v -d --gdb mem.gz | sqlite3 memPeak.sql3
igprof-analyse -r MEM_TOTAL -s -v -d --gdb mem.gz | sqlite3 memTOTAL.sql3
igprof-analyse -r MEM_MAX -s -v -d --gdb mem.gz | sqlite3 memMAX.sql3
valsdav commented 1 year ago

Adding the MEM_TOTAL profile for all the producers

image

valsdav commented 1 year ago

Do the above checks indicate that deepTau is not the causing the memory issues? The reason I ask is that when we looked into this previously we determined that the exception message 'invalid prediction = nan for tau_index = 0, pred_index = 0' occurs when their is excessive memory usage causing Tenorflow to return a nan values, but we never confirmed that the deepTau modules were responsible for the excessive memory usage in the first place. Is it possible that some other module(s) are responsible and the exception message is a red herring as the deepTau module tends to be the first one to crash when there are memory issues? Is it possible to check the memory usage of all other modules to confirm this?

If we follow this reasoning, and if we make the hypothesis that DeepMET is having a large memory footprint, we may have an explanation for that. In the DeepMET producer in 12_4_X the TF session is created in each constructor for each thread. (see here ) This increases the memory consumption since the TF Graph (in the global cache), needs to be copied in each session.

We solved this issue in this PR in December 2022 https://github.com/cms-sw/cmssw/pull/40284, but this entered only 13_X.

I'm backporting those changes to test if they improve the memory pressure.. It may not be the solution of the problem, but this can alleviate the issue.

slava77 commented 1 year ago

and this is the MEM_LIVE report: https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memPeak/575

image.

These are the reports for the tensorflow session execution part. Then I think I have also found the function that allocates the buffers for the TF outputs and operations.

* MEM_TOTAL: https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memTot/46
  ![image](https://user-images.githubusercontent.com/1934093/252874600-2342391a-3330-44b6-869f-747702c715c9.png)

* MEM_LIVE: https://dvalsecc.web.cern.ch/cgi-bin/igprof-navigator/DeepTau/memPeak/554
  ![image](https://user-images.githubusercontent.com/1934093/252874561-c57ad999-ab54-47fa-ba59-75a7d270b5ac.png)

it's unclear where this report is collected; from the summary it looks like the end of the job from Validation.Performance.IgProfInfo import customise is needed as can be seen from https://twiki.cern.ch/twiki/bin/viewauth/CMS/RecoIntegration#Run_profiler_igprof and we need to look at the end of the event if dump triggering during TF calls is harder.

Another thought, if we are possibly to blame single large allocations, perhaps MEM_LIVE_PEAK is the best https://igprof.org/analysis.html (but I can't recall even using it; not obvious this actually is functional)

About nans in DeepTau, I seem to recall that it's on SSE-only machines. What is used here? TF executes different code depending on the architecture.

valsdav commented 1 year ago

Hi @slava77 indeed this is collected at the end of the job. I'm rerunning now with a dump every event.

I agree that MEM_LIVE_PEAK seems what we want to analyze but I get an empty report from that if I try to run: igprof-analyse -r MEM_LIVE_PEAK -s -v -d --gdb mem_355381_163.gz > mem_LIVE_PEAK_355381_163.txt

slava77 commented 1 year ago

I agree that MEM_LIVE_PEAK seems what we want to analyze but I get an empty report from that if I try to run:

indeed, it's not in the igprof code https://github.com/search?q=repo%3Acms-externals%2Figprof%20MEM_LIVE_PEAK&type=code

yongbinfeng commented 1 year ago

To make things more confusing.. we have selected a few events in the ZeroBias dataset which triggered Exception Message: invalid prediction = nan for DeepTauId, took the same CMSSW version and the config file and ran the workflow.

Some of them even have tau sizes with ZERO, so in principle the inference should not be run at all:

https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_14_patch1/RecoTauTag/RecoTau/plugins/DeepTauId.cc#L1241

but somehow from these logs it seems the inference was running and producing some Nan values...

Example events are Event run: 360019 lumi: 401 event: 818626714 from here and Event run: 360116 lumi: 112 event: 146864354 from here

(The 1st log also shows it is not just deepTau2017v2p1; deepTau2018v2p5 also is producing Nan.)

VinInn commented 1 year ago

I run valgrind on 100 events of file /store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root

log file in lxplus803: /tmp/innocent/tauCrash/val.log and in /afs/cern.ch/user/i/innocent/public/val.log.xz

some info

==1366973== HEAP SUMMARY:
==1366973==     in use at exit: 1,183,553,439 bytes in 2,844,156 blocks
==1366973==   total heap usage: 2,278,256,235 allocs, 2,275,412,084 frees, 415,964,964,962 bytes allocated
.....
==1366973== LEAK SUMMARY:
==1366973==    definitely lost: 49,312,741 bytes in 350,998 blocks
==1366973==    indirectly lost: 435,535,371 bytes in 1,533,442 blocks
==1366973==      possibly lost: 95,928,756 bytes in 100,332 blocks
==1366973==    still reachable: 586,860,032 bytes in 728,642 blocks
==1366973==                       of which reachable via heuristic:
==1366973==                         newarray           : 1,409,984 bytes in 4,695 blocks
==1366973==                         multipleinheritance: 1,782,944 bytes in 22,254 blocks
==1366973==         suppressed: 15,916,539 bytes in 130,742 blocks
==1366973== Reachable blocks (those to which a pointer was found) are not shown.

plenty of leaks from Tensorflow ONIX and Root

there are things from dqm like

==1366973== 8,992,190 (679,968 direct, 8,312,222 indirect) bytes in 4,722 blocks are definitely lost in loss record 253,176 of 253,186
==1366973==    at 0x403AF01: operator new(unsigned long) (in /cvmfs/cms.cern.ch/el8_amd64_gcc10/external/valgrind/3.17.0-12ba960b086b024cce391ad228855798/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1366973==    by 0x106E33C1: dqm::implementation::IBooker::bookME(TString const&, MonitorElementData::Kind, std::function<TH1* ()>, bool) (in /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_14/lib/el8_amd64_gcc
10/libDQMServicesCore.so)
==1366973==    by 0x8546EED1: HistogramManager::book(dqm::implementation::IBooker&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_14/lib/el8_amd64_gcc10/libDQMSiPixelPhase1Common.so
)
==1366973==    by 0x8547B7D4: SiPixelPhase1Base::bookHistograms(dqm::implementation::IBooker&, edm::Run const&, edm::EventSetup const&) (in /cvmfs/cms.cern.ch/el8_amd64_gcc10/cms/cmssw/CMSSW_12_4_14/lib/el8_amd64_gcc10/
libDQMSiPixelPhase1Common.so)

and then form EventSetup and even the OutputModule

VinInn commented 1 year ago

To make things more confusing.. we have selected a few events in the ZeroBias dataset which triggered Exception Message: invalid prediction = nan for DeepTauId, some of them even have tau sizes with ZERO, so in principle the inference should not be run at all: https://github.com/cms-sw/cmssw/blob/CMSSW_12_4_14_patch1/RecoTauTag/RecoTau/plugins/DeepTauId.cc#L1241 but somehow it seems the inference is running and producing some Nan values Example events are Event run: 360019 lumi: 401 event: 818626714 from here and Event run: 360116 lumi: 112 event: 146864354 from here

run in gdb... (add cout...)

VinInn commented 1 year ago

How can I reproduce? (how many events one should skip?)

yongbinfeng commented 1 year ago

How can I reproduce? (how many events one should skip?)

I am directly running on that specific single event after edmCopyPickMerge and checking the tau collection size of the event.

VinInn commented 1 year ago

can you share the file?

yongbinfeng commented 1 year ago
/afs/cern.ch/work/y/yofeng/public/tests/taucrash2/CMSSW_12_4_14_patch1/src/test7/pickevents_*.root

and the pset config file is in the same directory.

(sorry i was not saying i can reproduce the problem. i was saying the tau collection seems to be zero, or at least what showed up in the tau collection in the DeepTauId producer is zero, so i was really confused how the inference could run and produce Nans during the production...)

VinInn commented 1 year ago

(sorry i was not saying i can reproduce the problem. i was saying the tau collection seems to be zero, or at least what showed up in the tau collection in the DeepTauId producer is zero, so i was really confused how the inference could run and produce Nans during the production...)

memory corruption of some sort. size maybe "negative" for instance

v.

VinInn commented 1 year ago

in any case one can run with

process.source = cms.Source("PoolSource",
    fileNames = cms.untracked.vstring(
        '/store/data/Run2022E/ZeroBias/RAW/v1/000/360/019/00000/7facb45d-b9d0-4b36-86bd-51a5571d61d9.root'
    ),
    lumisToProcess = cms.untracked.VLuminosityBlockRange("360019:401-360019:401"),
    secondaryFileNames = cms.untracked.vstring()
)

and the event in question is one of the first to be processed. And of course will not crash on you....

I'm personally do not believe in cosmic ray's induced memory error, still...

VinInn commented 1 year ago

Anyhow I suggest to bail out immediately if tau->size == 0 in DeepTau producer... (not that will save from memory corruption, still wil not invoke tensorflow init for each event!)

VinInn commented 1 year ago

I'm now running the job pointed out by @drkovalskyi https://cms-unified.web.cern.ch/cms-unified/joblogs/pdmvserv_Run2022D_JetMET_27Jun2023_230627_120337_7589/50660/DataProcessing/a9b71227-daa7-471e-935e-ac0a7b906e50-93-1-logArchive/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

I'm now around event 201853029 and on lxplus8 (an old Broadwell btw) I read VSIZE 13201.7 0 RSS 5949.8 while in the log above is VSIZE 14227.7 0 RSS 7896.56

They already started different by 1GB or so....

VinInn commented 1 year ago

on lxplus top reports

1892343 innocent  20   0   13.1g   5.3g  72040 R 396.0  18.6 824:33.14  3   2.7g   12.0g 144k    0 -          cmsRun

if some node are configured with ZERO SWAP I cannot exclude that the 2.7g in the swap are kept in memory...

VinInn commented 1 year ago

Can we also check if there is any correlation with running el8 on el7?

VinInn commented 1 year ago

The two NaN reported above are both on X5650 (westmere SSE only). We do not run tests nor relval on those

VinInn commented 1 year ago

Back to @drkovalskyi job: it starts to open the second file

%MSG
30-Jun-2023 05:42:41 CEST  Closed file root://xrootd-cms.infn.it//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root
30-Jun-2023 05:42:41 CEST  Initiating request to open file root://cmsxrootd-kit.gridka.de:1094//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root
%MSG-w XrdAdaptorInternal:  file_open 30-Jun-2023 05:42:42 CEST  Run: 357899 Event: 202675427
Failed to open file at URL root://cmsxrootd-kit.gridka.de:1094//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root?xrdcl.requuid=43e755d1-185e-4922-92a5-70f7811c71bb.
%MSG
%MSG-w XrdAdaptorInternal:  file_open 30-Jun-2023 05:42:42 CEST  Run: 357899 Event: 202675427
Failed to open file at URL root://cmsxrootd-kit.gridka.de:1094//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root?tried=&xrdcl.requuid=4b89a586-aefd-490a-9080-46241e8a97fb.
%MSG
30-Jun-2023 05:42:42 CEST  Initiating request to open file root://xrootd-cms.infn.it//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root
%MSG-w XrdAdaptor:  file_open 30-Jun-2023 05:42:44 CEST  Run: 357899 Event: 202675427
Data is served from lnl.infn.it instead of original site T2_FR_GRIF_LLR
%MSG
%MSG-w XrdAdaptor:  (NoModuleName) 30-Jun-2023 05:43:01 CEST pre-events
Data is now served from lnl.infn.it, fnal.gov instead of previous lnl.infn.it
%MSG
%MSG-w XrdAdaptor:  file_open 30-Jun-2023 05:43:02 CEST  Run: 357899 Event: 202675427
Data is now served from fnal.gov instead of previous lnl.infn.it, fnal.gov
%MSG
30-Jun-2023 05:43:35 CEST  Successfully opened file root://xrootd-cms.infn.it//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/c02e754c-9125-4a23-a14b-114ecf4697b2.root
[2023-06-30 05:47:58.011953 +0200][Error  ][PostMaster        ][  596] [t2-xrdcms.lnl.infn.it:7070] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 05:48:14.006501 +0200][Error  ][PostMaster        ][  596] [cmssrv608.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 05:51:09.066730 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-site3.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 05:53:11.009284 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-site2.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 05:59:19.001600 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-site2.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:01:20.016348 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-site1.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:02:57.004479 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-kit.gridka.de:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:02:59.001902 +0200][Error  ][PostMaster        ][  596] [t2-xrdcms.lnl.infn.it:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:12.001522 +0200][Error  ][PostMaster        ][  596] [cmssrv608.fnal.gov:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:17.002084 +0200][Error  ][PostMaster        ][  596] [xrootd.ba.infn.it:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:17.002140 +0200][Error  ][PostMaster        ][  596] [xrootd.ba.infn.it:1194] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:24.022258 +0200][Error  ][PostMaster        ][  596] [cmsxrootd-site3.fnal.gov:1093] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:51.003394 +0200][Error  ][PostMaster        ][  596] [xrootd-redic.pi.infn.it:1094] Forcing error on disconnect: [ERROR] Operation interrupted.
[2023-06-30 06:03:51.003456 +0200][Error  ][PostMaster        ][  596] [xrootd-redic.pi.infn.it:1194] Forcing error on disconnect: [ERROR] Operation interrupted.

A fatal system signal has occurred: external termination request
The following is the call stack containing the origin of the signal.

in the mean time RSS grows

2023-06-30 05:45:59,289:INFO:PerformanceMonitor:PSS: 8957639; RSS: 9113572; PCPU: 279; PMEM: 1.7
2023-06-30 05:50:59,397:INFO:PerformanceMonitor:PSS: 9293479; RSS: 9449436; PCPU: 275; PMEM: 1.7
2023-06-30 05:55:59,494:INFO:PerformanceMonitor:PSS: 9479881; RSS: 9486300; PCPU: 272; PMEM: 1.7
2023-06-30 06:00:59,588:INFO:PerformanceMonitor:PSS: 9875799; RSS: 9882528; PCPU: 269; PMEM: 1.8
2023-06-30 06:05:59,681:INFO:PerformanceMonitor:PSS: 10208008; RSS: 10214928; PCPU: 266; PMEM: 1.9
2023-06-30 06:05:59,683:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 4
Job has exceeded maxPSS: 10000 MB
Job has PSS: 10208 MB

2023-06-30 06:05:59,696:ERROR:PerformanceMonitor:Attempting to kill step using SIGUSR2
2023-06-30 06:10:59,790:INFO:PerformanceMonitor:PSS: 10592602; RSS: 10599420; PCPU: 263; PMEM: 2.0
2023-06-30 06:10:59,790:ERROR:PerformanceMonitor:Attempting to kill step using SIGTERM
2023-06-30 06:11:11,999:CRITICAL:CMSSW:Error running cmsRun
{'arguments': ['/bin/bash', '/srv/job/WMTaskSpace/cmsRun1/cmsRun1-main.sh', '', 'el8_amd64_gcc10', 'scramv1', 'CMSSW', 'CMSSW_12_4_14_patch1', 'FrameworkJobReport.xml', 'cmsRun', 'PSet.py', '']}
Linux Return code: 143

any explanation?

slava77 commented 1 year ago

BTW, in #33442 I observed 50 times slowdown in TF inference (with some operations 1K slower) on SSE-only machines. I'm curious how bad is it here. The lowdown likely also means a crash in some thread can much more frequently have TF running in any other thread.

VinInn commented 1 year ago

this is in contrast what happened on lxplus8

%MSG-w MemoryCheck:  PATPackedCandidateProducer:packedPFCandidates  13-Jul-2023 14:25:40 CEST Run: 357899 Event: 202444811
MemoryCheck: module PATPackedCandidateProducer:packedPFCandidates VSIZE 13457.7 0 RSS 5933.24 0.257812
--
%MSG
13-Jul-2023 15:16:49 CEST  Closed file root://xrootd-cms.infn.it//store/data/Run2022D/JetMET/RAW/v1/000/357/899/00000/553156f5-b08e-4fcf-9318-a72b73572c76.root
--
%MSG-w MemoryCheck:  source 13-Jul-2023 15:17:01 CEST PreSource
MemoryCheck: module source:source VSIZE 13457.8 0.0859375 RSS 5478.31 -454.934
--
%MSG-w MemoryCheck:  source 13-Jul-2023 15:17:01 CEST PreSource
MemoryCheck: module source:source VSIZE 13457.8 0 RSS 5478.74 0.429688
--
%MSG-w MemoryCheck:  SiStripFEDCheckPlugin:siStripFEDCheck  13-Jul-2023 15:17:01 CEST Run: 357899 Event: 203313754
MemoryCheck: module SiStripFEDCheckPlugin:siStripFEDCheck VSIZE 13457.8 0 RSS 5481.58 2.84375

in the meanwhile it reached VSIZE 14097.8 0 RSS 5807.15 1892343 innocent 20 0 13.8g 5.5g 291460 R 398.7 19.3 1012:52 3 2.4g 12.6g 199k 0 - cmsRun

VinInn commented 1 year ago

I'm now running "on Metal" on a machine with an enormous amount of memory job just started and

VSIZE 10905.5 0 RSS 7460.48
    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND SWAP    DATA
3534859 innocent  20   0   11.2g   7.2g 731588 R 397.7   0.7  26:09.56 cmsRun 0   10.0g

so if VSS becomes as large as 15GB and the machine CAN keep a large fraction of it resident she will (and we kill the job)

VinInn commented 1 year ago

uhmmm

[innocent@lxplus803 tauCrash]$ cat /proc/sys/vm/swappiness
30
[innocent@olspx-01 tauCrash]$ cat /proc/sys/vm/swappiness
10
[innocent@patatrack02 ~]$ cat /proc/sys/vm/swappiness
60

60 seems the default (according to https://www.howtogeek.com/449691/what-is-swapiness-on-linux-and-how-to-change-it/ )

VinInn commented 1 year ago

I think we should change the PSS memory limit to a fraction of the memory available per core (99%) in this case

[innocent@lxplus803 tauCrash]$ grep MemTotal /proc/meminfo
MemTotal:       29938092 kB
[innocent@lxplus803 tauCrash]$ lscpu | egrep '^CPU\(s\)'
CPU(s):              10

is 3*4 = 12 GB

in this other

[innocent@olspx-01 tauCrash]$ grep MemTotal /proc/meminfo
MemTotal:       1056224784 kB
[innocent@olspx-01 tauCrash]$ lscpu | egrep '^CPU\(s\)'
CPU(s):              224

4.7*4 = 18.8 GB

sextonkennedy commented 1 year ago

@valsdav Now that we know that DeepMet is taking the most memory by far, I'd like to ask, how is the back port of #40284 you mentioned going? Dima reported that due to an error much of the work already done on this workflow has to be redone.

VinInn commented 1 year ago

I run over the two files for a total of 16262 events

[innocent@olspx-01 tauCrash]$ grep 'RSS' fullJob.log | tail
MemoryCheck: module EcalUncalibRecHitProducer:ecalMultiFitUncalibRecHit@cpu VSIZE 13409.8 0 RSS 9396.96 0.257812
MemoryCheck: module PFRecHitProducer:particleFlowRecHitHBHE VSIZE 13409.8 0 RSS 9397.22 0.257812
MemoryCheck: module JetCoreClusterSplitter:siPixelClusters VSIZE 13409.8 0 RSS 9397.48 0.257812
MemoryCheck: module MeasurementTrackerEventProducer:MeasurementTrackerEvent VSIZE 13409.8 0 RSS 9397.73 0.257812
MemoryCheck: module MkFitEventOfHitsProducer:mkFitEventOfHits VSIZE 13409.8 0 RSS 9397.99 0.257812
MemoryCheck: module HitPairEDProducer:initialStepHitDoublets VSIZE 13409.8 0 RSS 9400.25 2.25781
MemoryCheck: module PFMultiDepthClusterProducer:particleFlowClusterHCAL VSIZE 13409.8 0 RSS 9400.5 0.253906
MemoryCheck: module MkFitProducer:initialStepTrackCandidatesMkFit VSIZE 13409.8 0 RSS 9410 9.49609
MemoryCheck: module MkFitOutputConverter:initialStepTrackCandidates VSIZE 13409.8 0 RSS 9412.58 2.57812
MemoryCheck: module PFClusterProducer:particleFlowClusterHF VSIZE 13409.8 0 RSS 9416.44 3.86328

Definetively 10GB of PSS are too small to be safe for a 4thread reco job

Dr15Jones commented 1 year ago

@VinInn would it be possible for you to pull the RSS info from the log and try to graph the change of RSS over time within the job based on the log info? That might help pinpoint if there is a sudden jump somewhere.

VinInn commented 1 year ago

on lxplus I have even RSS PSS and VSS each 30 seconds. can run it on the big machine as well (on lxplus it aggressively move memory to swap)

this is the result of grep 'RSS' fullJob.log | awk '{print $8 }' | tr '\n' ',' > RSS.csv so the x-axis is just the printout-occurence image

Anyhow no big jump (but one?) . I think it is just memory fragmentation and we keep allocating contiguous memory elsewhere...

Dr15Jones commented 1 year ago

@VinInn Thanks for the plot! That looks really, really bad. I normally would expect the plot to look like what is shown between 0 and 8000 (I assume x axis is seconds?). Is there anything in the log pointing to what happened around the 8000 mark?

VinInn commented 1 year ago

no x is just the printout occurrence of the MemoryCheck

VinInn commented 1 year ago

here is a bug jump %MSG-w MemoryCheck: CkfTrackCandidateMaker:tobTecStepTrackCandidates 14-Jul-2023 11:07:06 CEST Run: 357899 Event: 201371551 MemoryCheck: module CkfTrackCandidateMaker:tobTecStepTrackCandidates VSIZE 12505.8 0 RSS 8363.38 0.105469 %MSG Begin processing the 1803rd record. Run 357899, Event 201986242, LumiSection 149 on stream 3 at 14-Jul-2023 11:07:06.042 CEST

%MSG %MSG-w MemoryCheck: PoolOutputModule:MINIAODoutput 14-Jul-2023 11:07:15 CEST Run: 357899 Event: 202535253 MemoryCheck: module PoolOutputModule:MINIAODoutput VSIZE 12921.8 416 RSS 8480.61 117.227 %MSG %MSG-w MemoryCheck: EndPathStatusInserter:MINIAODoutput_step 14-Jul-2023 11:07:15 CEST Run: 357899 Event: 202535253 MemoryCheck: module EndPathStatusInserter:MINIAODoutput_step VSIZE 12921.8 0 RSS 8480.73 0.125 %

VinInn commented 1 year ago

there are other occurrency of the OutputModule with no jump

Dr15Jones commented 1 year ago

there are other occurrency of the OutputModule with no jump

Unfortunately, the module associated with a memory check log message isn't a very reliable indicator when using multiple threads. The log message just says which module finished but the increase in memory could be caused by another module running concurrently with the ending of the reporting module.