cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.3k forks source link

Memory consumption too high in PromptReco on EGamma1, ParkingDoubleElectronLowMass, and JetMET0 during replay with CMSSW 13_0_4 #41457

Closed malbouis closed 1 year ago

malbouis commented 1 year ago

During a replay with the new CMSSW release 13_0_4, we observed a crash due to too much memory consumption in Prompt Reco for the datasets EGamma1, ParkingDoubleElectronLowMass, and JetMET0.

The tar ball regarding this crash can be found here: /afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_4/Memory/job_1467

For more details, please refer to this cmsTalk post.

cmsbuild commented 1 year ago

A new Issue was created by @malbouis .

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

germanfgv commented 1 year ago

The issue is also appearing in other PDs: ReservedDoubleMuonLowMass, Muon[0,1], DisplacedJet and ZeroBias

makortel commented 1 year ago

I'm trying to take a look with a heap profiler (before that difficult to assing)

makortel commented 1 year ago

assign reconstruction, dqm

(before that difficult to assing)

Well, the cause is likely either in RECO or in DQM, so maybe useful to assign early anyway.

cmsbuild commented 1 year ago

New categories assigned: dqm,reconstruction

@micsucmed,@rvenditti,@mandrenguyen,@emanueleusai,@syuvivida,@clacaputo,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

malbouis commented 1 year ago

Hi @cms-sw/reconstruction-l2 and @cms-sw/dqm-l2 , I know there are a lot of issues going on in this ramp-up period but did you have the chance to look into this issue? It should definitely be investigated as it showed up in the most recent replay we did and we are about to launch a new one to test CMSSW_13_0_5.

makortel commented 1 year ago

Here is a plot extracted from the log file showing the RSS as a function of the timestamp of the printout simplememory_rss Towards the end of the job there is a rapid increase of ~4 GB, which then decreases.

While doing the plot I noticed the job processed 73 events on 8 threads, yet it took the job about an hour 15 minutes to process the data. End-of-job time report shows

 Time Summary: 
 - Min event:   107.015
 - Max event:   2101.15
 - Avg event:   461.379
 - Total loop:  4550.8
 - Total init:  169.891
 - Total job:   4739.79
 - EventSetup Lock: 0
 - EventSetup Get:  0
 Event Throughput: 0.0160412 ev/s
 CPU Summary: 
 - Total loop:     33277.5
 - Total init:     155.67
 - Total extra:    0
 - Total children: 324.451
 - Total job:      33446.1

i.e. the average time to process an even was ~7.5 minutes, with the maximum being 35 minutes! (even the minumum was almost 2 minutes)

Are these events particularly heavy? Or is there some runaway module?

makortel commented 1 year ago

Are these events particularly heavy?

On this line of though I see the log has 401 occurrences of

%MSG-e TooManyPairs:  HitPairEDProducer:pixelPairStepHitDoublets  27-Apr-2023 20:25:56 CEST Run: 366498 Event: 277351
number of pairs exceed maximum, no pairs produced
%MSG

from pixelPairElectronHitDoublets,pixelPairStepHitDoublets, stripPairElectronHitDoublets.

mandrenguyen commented 1 year ago

I believe the pixel-pair step comes into play when there's an inactive region of the pixel tracker. Could there be a larger than usual pixel dead area in these events? Maybe @mmusich or @slava77 have some insight? There are also plenty of warnings of the following type from the pixel-pair step track propagation:

%MSG-w BasicTrajectoryState:   CkfTrackCandidateMaker:pixelPairStepTrackCandidates 01-May-2023 16:18:56 CEST  Run: 366498 Event: 615981
local error not pos-def
mmusich commented 1 year ago

Could there be a larger than usual pixel dead area in these events?

I am not aware of particularly large new dead regions, but I haven't yet checked in details. By the way I have seen plenty of these when running checks for the low pT electron issue, but I am wondering if this is a red herring. Do I understand correctly that these high memory jobs occurred in a replay with 13_0_4, while they didn't occur in real Prompt in 13_0_3? Should we not focus on what changed in between?

makortel commented 1 year ago

I ran the job on one thread. Here are the top-10 reported memory increases

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds  01-May-2023 19:12:42 CEST Run: 366498 Event: 210235
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds VSIZE 14910 0 RSS 6711.33 1700.12

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds  01-May-2023 15:54:33 CEST Run: 366498 Event: 1043611
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds VSIZE 12841.9 896 RSS 5974.74 1105.32

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  01-May-2023 15:37:08 CEST Run: 366498 Event: 277351
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 9113.46 768 RSS 4795.14 962.957

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  01-May-2023 15:59:42 CEST Run: 366498 Event: 219521
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 12842 0 RSS 4873.21 940.566

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  01-May-2023 15:42:02 CEST Run: 366498 Event: 268247
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 9129.86 0 RSS 4791.96 896.84

%MSG-w MemoryCheck:  CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets  01-May-2023 19:08:34 CEST Run: 366498 Event: 210235
MemoryCheck: module CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets VSIZE 14910 0 RSS 4494.42 803.926

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  01-May-2023 15:54:18 CEST Run: 366498 Event: 1043611
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 11945.9 0 RSS 4863.84 763.105

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds  01-May-2023 16:16:21 CEST Run: 366498 Event: 649939
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds VSIZE 12842 0 RSS 5590.77 474.164

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  01-May-2023 19:12:21 CEST Run: 366498 Event: 210235
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 14910 0 RSS 4963.2 468.773

%MSG-w MemoryCheck:  CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets  01-May-2023 15:52:07 CEST Run: 366498 Event: 1043611
MemoryCheck: module CAHitQuadrupletEDProducer:detachedQuadStepHitQuadruplets VSIZE 11945.9 1536 RSS 4100.73 331.551

The last number of the printout is the RSS increase by the module.

malbouis commented 1 year ago

FYI, we are observing this crash in the replay that is currently running for CMSSW_13_0_5. There are a few paused jobs and I hear from @germanfgv that they are crashing due to this issue.

makortel commented 1 year ago

Would it be feasible to have a 13_0_3 replay, or at minimum get the corresponding PSet, on these data? (the PSet from 13_0_4 does not work in 13_0_3)

dan131riley commented 1 year ago

One thread is certainly easier to interpret. I've been looking for modules with long run times and large RSS delta, and came up with some of the same suspects,

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds  27-Apr-2023 21:30:02 CEST Run: 366498 Event: 1181675
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:pixelPairElectronSeeds VSIZE 30694.3 0 RSS 13348.7 93.0898

%MSG-w MemoryCheck:   SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds  27-Apr-2023 21:30:24 CEST Run: 366498 Event: 1181675
MemoryCheck: module SeedCreatorFromRegionConsecutiveHitsEDProducer:stripPairElectronSeeds VSIZE 31078.3 0 RSS 16184.9 697.008
mmusich commented 1 year ago

Would it be feasible to have a 13_0_3 replay, or at minimum get the corresponding PSet, on these data? (the PSet from 13_0_4 does not work in 13_0_3)

Out of curiosity can someone try to run this job with era Run3 instead of Run3_2023?

mmusich commented 1 year ago

Out of curiosity can someone try to run this job with era Run3 instead of Run3_2023?

answering to myself, I tried to reproduce the configuration leading to the issue (in CMSSW_13_0_4) with

python3 Configuration/DataProcessing/test/RunPromptReco.py --scenario ppEra_Run3_2023 --reco --dqmio --dqmSeq=@common+@ecal+@egamma+@L1TEgamma --aod --global-tag 130X_dataRun3_Prompt_Candidate_2023_03_09_09_47_16 --lfn /store/backfill/1/data/Tier0_REPLAY_2023/EGamma1/RAW/v27184538/000/366/498/00000/694408a4-44b2-4a22-8fa5-7c68890bf99b.root --alcareco EcalUncalZElectron+EcalUncalWElectron+HcalCalIterativePhiSym+HcalCalIsoTrkProducerFilter+EcalESAlign --PhysicsSkims=@EGamma0

and compared that with what I obtain with the old (ppEra_Run3 setting):

python3 Configuration/DataProcessing/test/RunPromptReco.py --scenario ppEra_Run3 --reco --dqmio --dqmSeq=@common+@ecal+@egamma+@L1TEgamma --aod --global-tag 130X_dataRun3_Prompt_Candidate_2023_03_09_09_47_16 --lfn /store/backfill/1/data/Tier0_REPLAY_2023/EGamma1/RAW/v27184538/000/366/498/00000/694408a4-44b2-4a22-8fa5-7c68890bf99b.root --alcareco EcalUncalZElectron+EcalUncalWElectron+HcalCalIterativePhiSym+HcalCalIsoTrkProducerFilter+EcalESAlign --PhysicsSkims=@EGamma0

by running on 5 events of this run 366498 this is the RSS profile I get:

image

germanfgv commented 1 year ago

I'll try 13_0_5_patch1 with scenario ppEra_Run3 on some of the affected datasets.

mmusich commented 1 year ago

Would it be feasible to have a 13_0_3 replay, or at minimum get the corresponding PSet, on these data? (the PSet from 13_0_4 does not work in 13_0_3)

running

python3 Configuration/DataProcessing/test/RunPromptReco.py --scenario ppEra_Run3 --reco --dqmio --dqmSeq=@common+@ecal+@egamma+@L1TEgamma --aod --global-tag 130X_dataRun3_Prompt_Candidate_2023_03_09_09_47_16 --lfn /store/backfill/1/data/Tier0_REPLAY_2023/EGamma1/RAW/v27184538/000/366/498/00000/694408a4-44b2-4a22-8fa5-7c68890bf99b.root --alcareco EcalUncalZElectron+EcalUncalWElectron+HcalCalIterativePhiSym+HcalCalIsoTrkProducerFilter+EcalESAlign --PhysicsSkims=@EGamma

in a 13_0_3, the picture is not dramatically different:

image

slava77 commented 1 year ago

41265 was added in 13_0_4, perhaps it bumped up the baseline memory use to take everything over the threshold.

Are we running with concurrent lumis now compared to 12_4 for 2022? or did the change happen earlier. I can think of a downside being that heavy event crowding is now more likely (they take long time to process and also come with memory use peaks; lumi partitioning prevents more of them to processed at the same time) @makortel @Dr15Jones

mandrenguyen commented 1 year ago

I've been running for a couple of hours on lxplus8 using the tarball from the OP, with the SimpleMemoryChecker enabled. I'm more than 300 events in and I don't see the total memory much exceeding 14 GB, while the original log shows RSS exceeding GB using the same tool around event # 73. Has anyone been able to reproduce an RSS approaching the limit of 16 GB?

makortel commented 1 year ago

Are we running with concurrent lumis now compared to 12_4 for 2022? or did the change happen earlier.

Concurrent lumis were enabled already before 12_4.

germanfgv commented 1 year ago

The scenario donsn't seem to make any difference. Trying scenario ppEra_Run3 with CMSSW_13_0_5_patch1, we see the same high memory issues. You can find tarball for this latest occurrance of the problem here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/job_25

You can see PerformanceMonitor tried to kill the job starting at 15:37:11 when it reached a PSS of

2023-05-02 15:37:11,149:INFO:PerformanceMonitor:PSS: 17238627; RSS: 17046588; PCPU: 767; PMEM: 8.6
2023-05-02 15:37:11,150:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 8
Job has exceeded maxPSS: 16000 MB
Job has PSS: 17238 MB
slava77 commented 1 year ago

The scenario donsn't seem to make any difference. Trying scenario ppEra_Run3 with CMSSW_13_0_5_patch1, we see the same high memory issues. You can find tarball for this latest occurrance of the problem here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/job_25

You can see PerformanceMonitor tried to kill the job starting at 15:37:11 when it reached a PSS of

2023-05-02 15:37:11,149:INFO:PerformanceMonitor:PSS: 17238627; RSS: 17046588; PCPU: 767; PMEM: 8.6
2023-05-02 15:37:11,150:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 8
Job has exceeded maxPSS: 16000 MB
Job has PSS: 17238 MB

is it possible to increase the memory just to know how much the job will need? (I'm not proposing to make it a default)

germanfgv commented 1 year ago

Actually, the job finished. It seems the SIGUSR2 signal that the wrapper uses to kill the job did not work, because the job log goes for several minutes after that and ends with exit code 0, as you can see in

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/job_25/job/WMTaskSpace/cmsRun1/cmsRun1-stdout.log

I'll try to increase the limit anyway, if anything, simply to check how the wrapper memory measurements compare to the internal measurements.

I have a question related to this. In the MemoryCheck messages I see the following:

%MSG-w MemoryCheck:  JetAnalyzer:jetDQMAnalyzerAk4PFUncleaned  02-May-2023 15:37:08 CEST Run: 366498 Event: 61264126
MemoryCheck: module JetAnalyzer:jetDQMAnalyzerAk4PFUncleaned VSIZE 31993.2 0 RSS 16330 0.246094

Are those RSS values in kB or in kiB? I'm assuming/hoping they are in kB.

Dr15Jones commented 1 year ago

A SIGUSR2 code is caught by the framework and is used to stop the job early. With such a signal, the job will still exit with a value of 0 since it did shut down clearly. So the job 'finished' but probably didn't process all the events in the input.

makortel commented 1 year ago

Are those RSS values in kB or in kiB? I'm assuming/hoping they are in kB.

The RSS (and VSIZE) are in MiB.

germanfgv commented 1 year ago

A SIGUSR2 code is caught by the framework and is used to stop the job early. With such a signal, the job will still exit with a value of 0 since it did shut down clearly. So the job 'finished' but probably didn't process all the events in the input.

Ohh ok. Thanks for the clarification. I increased the limit to 20GB. I'll share the output when I get it.

germanfgv commented 1 year ago

I'm running a small replay with the current production configuration (CMSSW_13_0_3), only with dataset JetMET0, so we can compare it with this:

The scenario donsn't seem to make any difference. Trying scenario ppEra_Run3 with CMSSW_13_0_5_patch1, we see the same high memory issues. You can find tarball for this latest occurrance of the problem here:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/job_25

You can see PerformanceMonitor tried to kill the job starting at 15:37:11 when it reached a PSS of

2023-05-02 15:37:11,149:INFO:PerformanceMonitor:PSS: 17238627; RSS: 17046588; PCPU: 767; PMEM: 8.6
2023-05-02 15:37:11,150:ERROR:PerformanceMonitor:Error in CMSSW step cmsRun1
Number of Cores: 8
Job has exceeded maxPSS: 16000 MB
Job has PSS: 17238 MB
makortel commented 1 year ago

Has anyone been able to reproduce an RSS approaching the limit of 16 GB?

I ran a test of 73 events (that were processed in the original job) on slc7 machine, that reached 14.5 GB.

Longest running modules (> 5 sec) were

TimeReport  63.196841    63.196841    63.196841  lowPtTripletStepHitTriplets
TimeReport  57.752829    57.752829    57.752829  highPtTripletStepHitTriplets
TimeReport  49.104601    49.104601    49.104601  detachedQuadStepHitQuadruplets
TimeReport  35.766882    35.766882    35.766882  detachedTripletStepHitTriplets
TimeReport  27.754519    27.754519    27.754519  lowPtQuadStepHitQuadruplets
TimeReport  21.177794    21.177794    21.177794  initialStepHitQuadrupletsPreSplitting
TimeReport  21.139621    21.139621    21.139621  initialStepHitQuadruplets
TimeReport  15.396089    15.396089    15.396089  pixelPairElectronSeeds
TimeReport   7.218820     7.218820     7.218820  pixelPairStepTrackCandidates
TimeReport   6.141264     6.141264     6.141264  stripPairElectronSeeds
mandrenguyen commented 1 year ago

@makortel That seems consistent with what I found on lxplus8, but that's well below the max RSS in the original log file, and well within the limit that was set for the T0. Odd, no?

slava77 commented 1 year ago

@makortel That seems consistent with what I found on lxplus8, but that's well below the max RSS in the original log file, and well within the limit that was set for the T0. Odd, no?

memory use peaks are stochastic in multithreaded, I doubt a single run would conclusively show.

Dr15Jones commented 1 year ago

So I skipped the job_25 forward 158 events and started processing there. On the 15th event I hit an RSS of 15253.8. It is probably that there is some hysteresis in the job (e.g. ROOT IO buffers) so that could be showing the issue. I was running CMSSW_13_0_4 on an el8 machine using 8 threads.

makortel commented 1 year ago

@Dr15Jones noticed that despite of pixelPairElectronHitDoublets reporting "no pairs produced", the consuming module pixelPairElectronSeeds sees a SeedingHitSet with substantial amount of elements. Further investigation revealed that the printout came from https://github.com/cms-sw/cmssw/blob/85b455d63c5685b15564a5e0804565583e8b05ee/RecoTracker/TkHitPairs/src/HitPairGeneratorFromLayerPair.cc#L85-L88 and resulted in no pairs being produced for this specific layer pair, while the sum of pairs over all layer pairs in https://github.com/cms-sw/cmssw/blob/85b455d63c5685b15564a5e0804565583e8b05ee/RecoTracker/TkHitPairs/plugins/HitPairEDProducer.cc#L108-L122 does not exceed the maxElementsTotal.

I wonder if it makes sense for HitPairEDProducer to produce hit pairs only for some layer pairs (or regions), or would it make more sense for the module to "abort" and produce empty products immediately when some layer pair (for some region) results in more than maxElements hit pairs? @cms-sw/tracking-pog-l2 @cms-sw/egamma-pog-l2

swagata87 commented 1 year ago

would it make more sense for the module to "abort" and produce empty products immediately when some layer pair (for some region) results in more than maxElements hit pairs

this might lead to efficiency loss, no? does this change help with fixing the memory issue?

germanfgv commented 1 year ago

I'm running a small replay with the current production configuration (CMSSW_13_0_3), only with dataset JetMET0, so we can compare it with this:

Running the small replay with the same configuration as is in production, we got the memory issues again. This is particularly strange given that this run was processed with that configuration in production without any errors. For example, here you can find logs for one of the production production job:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/ProductionRun

This is runnig through the sames lumis as the job in:

/afs/cern.ch/user/c/cmst0/public/PausedJobs/Replay13_0_5_patch1/job_25

But the production job has a peak RSS of 8942.01 MB, while the 13_0_5_patch1 replay version was over 16000 when it was killed. 13_0_3 replay version of this job is still runnig.

I don't understand why this can be happening. I have asked ORM to pick a different run to test, as suggested by @mmusich yesterday.

mandrenguyen commented 1 year ago

would it make more sense for the module to "abort" and produce empty products immediately when some layer pair (for some region) results in more than maxElements hit pairs

this might lead to efficiency loss, no? does this change help with fixing the memory issue?

Regarding the efficiency loss, I suppose that depends on whether we are exceeding the limit in real collision events or in beam background events.

mmusich commented 1 year ago

beam background events.

would be nice to confirm that 366498 has indeed beam background. Unfortunately it seems it's the offline DQM is not equipped to pick that up. I guess an offline analysis would be in order.

mandrenguyen commented 1 year ago

Some additional info: In our profiling we have a ttbar single-thread wf (11834.21) with --era Run 3 and 2022 GT.
I see no significant increase in RSS for reco, mini or nano steps between 13_0_0 and 13_0_4. It's not proof that the increase is not tied to the release, but it narrows the phase space.

EDIT: It think I misread @germanfgv comment... Do we see excessive memory in 13_0_3 replays?

makortel commented 1 year ago

would it make more sense for the module to "abort" and produce empty products immediately when some layer pair (for some region) results in more than maxElements hit pairs

this might lead to efficiency loss, no? does this change help with fixing the memory issue?

Very likely, but IIUC these limits exist to protect the data processing infrastructure from excessive resource usage. I'd be tempted to argue already losing all hit doublets from 5 layer pairs (like in 366498:1:210235) leads to such efficiency loss in a way that the event might not be useful for physics, and if that is the case, could we just avoid processing bigger parts of th event?

In a sense I'd say the 366498:1:210235 (probably along others) is close to being unprocessable with the current reconstruction. Adjusting the "maximum limit" behavior could be a quick way to work around the problem.

swagata87 commented 1 year ago

already losing all hit doublets from 5 layer pairs (like in 366498:1:210235) leads to such efficiency loss in a way that the event might not be useful for physics

okay I understand now. In that case, if it is decided to put this change in cmssw then from egamma side we will keep an eye on release validation (specially electron track related quantities) just to make sure that there is no or minimal effect.

makortel commented 1 year ago

I wonder if it makes sense for HitPairEDProducer to produce hit pairs only for some layer pairs (or regions), or would it make more sense for the module to "abort" and produce empty products immediately when some layer pair (for some region) results in more than maxElements hit pairs?

I took the liberty of opening an RFC PR (exceptionally in 13_0_X branch directly) along this line in https://github.com/cms-sw/cmssw/pull/41514. In a quick test

Before proceeding further (with the PR to master) I'd like to hear at least from @cms-sw/tracking-pog-l2 if this approach would be viable.

slava77 commented 1 year ago

@makortel is early deletion working aggressively enough? Can it be that the module scheduling is spreading calls too thin so that the products that can be deleted actually stay in memory much longer

Dr15Jones commented 1 year ago

is early deletion working aggressively enough?

The data products in question are not on the early delete list. I've been trying to test already to see if adding them makes a difference. Results are not fully in but preliminary seems to indicate it is insufficient.

slava77 commented 1 year ago

is early deletion working aggressively enough?

The data products in question are not on the early delete list. I've been trying to test already to see if adding them makes a difference. Results are not fully in but preliminary seems to indicate it is insufficient.

I guess I was looking in a wrong place, in MC setup we have process.options.canDeleteEarly include RegionsSeedingHitSets_pixelPairElectronHitDoublets__RECO, similar for many more *Doublets

Dr15Jones commented 1 year ago

@slava77 It doesn't appear to be the *Doublets which are causing the problem, but instead what reads the Doublets.

slava77 commented 1 year ago

@slava77 It doesn't appear to be the *Doublets which are causing the problem, but instead what reads the Doublets.

Ah, I misunderstood then; because the proposed solution was to reduce the size of the *Doublets.

Which modules reading Doublets are a problem? Is it something specific (pixelPairElectronSeeds) or in general

makortel commented 1 year ago

Ah, I misunderstood then; because the proposed solution was to reduce the size of the *Doublets.

Really to make the size of *Doublets to 0, in which case the the code reading the *Doublets should do only very little work.

germanfgv commented 1 year ago

We launched a new replay yesterday, using post scrubbing runs and CMSSW_13_0_5_patch1. The replay is almost over and we don't see any memory usage errors. Not only that but the jobs are finishing much faster. So now it feels like we just wasted a lot of people's time.

I still don't understand how this particular run 366498, successfully processed on production, all of the sudden cannot be reconstructed in replays with the same configuration. What about scrubbing changes the way we recontruct the data?

mmusich commented 1 year ago

successfully processed on production,

this, I also don't understand :(

What about scrubbing changes the way we recontruct the data?

well, after scrubbing we'll have much less beam induced backgrounds, which - in turn - will lower the creation of large amounts of spurious tracking seeds (though this hasn't been confirmed yet, as far as I can tell, see https://github.com/cms-sw/cmssw/issues/41457#issuecomment-1532587959)

Dr15Jones commented 1 year ago

So I tried to add the data products made by the module's consuming the most data to the 'delete early' list:

process.options.canDeleteEarly.append('TrajectorySeeds_pixelPairElectronSeeds__RECO')
process.options.canDeleteEarly.append('TrajectorySeeds_stripPairElectronSeeds__RECO')

However, what ever minor improvement this made was dwarfed by the variability of memory usage seen from running the same multi-threaded job multiple times.