cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.28k forks source link

HGCAL in phase-2 premix workflow .999: missing HEB rechits, low response in HEF and EE #32050

Open slava77 opened 3 years ago

slava77 commented 3 years ago

followed by recent (e.g. https://github.com/cms-sw/cmssw/pull/31906#issuecomment-722578846 or https://github.com/cms-sw/cmssw/pull/32022#issuecomment-722025986) it looks like some of the premixing is not working in phase 2

here are a few plots comparing in CMSSW_11_2_X_2020-11-04-1100 23234.0 in black with 23434.999 in red (10 events; presumably the same sim), using the matrix outputs from the baseline jenkins job

all_baseline-remapVSbaseline_TTbar14TeV2026D49wf23234p0c_log10HGCRecHitsSorted_HGCalRecHit_HGCHEFRecHits_RECO_obj_obj_energy

all_baseline-remapVSbaseline_TTbar14TeV2026D49wf23234p0c_log10HGCRecHitsSorted_HGCalRecHit_HGCEERecHits_RECO_obj_obj_energy

HGCHEBRecHits are empty in 23434.999

slava77 commented 3 years ago

assign upgrade

cmsbuild commented 3 years ago

New categories assigned: upgrade

@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 3 years ago

A new Issue was created by @slava77 Slava Krutelyov.

@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

kpedro88 commented 3 years ago

@makortel @adas1994 please take a look if you have time

makortel commented 3 years ago

@makortel @adas1994 please take a look if you have time

Don't expect anything from me (beyond looking specific code someone else points to).

adas1994 commented 3 years ago

What's in the x-axis ? Is it Energy or Eta ? @kpedro88

kpedro88 commented 3 years ago

@adas1994 these plots show RecHit energy

adas1994 commented 3 years ago

@adas1994 these plots show RecHit energy

I don't understand the negative energy. But, give me this weekend. Let me see if I can find something out.

kpedro88 commented 3 years ago

Sorry, it's log10(energy), to display the low values better.

adas1994 commented 3 years ago

Last week I was almost off work and facing a huge backlog of work to push through. I know I promised to get something about it by this Monday but I failed. I am hoping to get something done about it within a couple of days though. I am really sorry for the delay @kpedro88 .

adas1994 commented 3 years ago

@kpedro88 I have looked into the plots in detail using K0_L thrown directly at HGCAL at 5 GeV. And it is a pretty serious issue actually. I don't have a clue at present why the energy distribution would be different from the reference, as I changed nothing on the accumulated charge but changed only how timing is computed. I would have to look into the code for possible bugs I might have introduced.

Here are some plots. RecHits_HGCEE_energy RecHits_HGCHEB_energy RecHits_HGCHEF_energy

kpedro88 commented 3 years ago

@adas1994 are you using minbias that you generated yourself, or the official sample used by runTheMatrix?

adas1994 commented 3 years ago

@kpedro88 I used official Sample used by runTheMatrix. One quick question - this is a very legit open issue and needs to be addressed ASAP. But I will remain extremely busy for the next few months. I was just wondering if there is any deadline before which this needs to be addressed. I would have to reschedule stuff in that case.

kpedro88 commented 3 years ago

@adas1994 it might be worth trying a new sample generated in the same release before digging through the code. it's possible the old sample is no longer 100% compatible.

adas1994 commented 3 years ago

@kpedro88 I will try new samples and update you on it as soon as I can.

adas1994 commented 3 years ago

@kpedro88 - sorry to bother again, but I am a little confused as to how is it possible to compare plots from workflows 23234.0 and 23434.999. The workflow 23234.0 has no pileups, but 23434.999 has.

kpedro88 commented 3 years ago

@adas1994 that's true, I read too fast: 23434.0 (PU workflow, needs to be modified to set PU to 50) should be compared to 23434.999. @slava77 FYI

adas1994 commented 3 years ago

@adas1994 that's true, I read too fast: 23434.0 (PU workflow, needs to be modified to set PU to 50) should be compared to 23434.999. @slava77 FYI

Cool. I will ping you once I stumble upon something concrete.

slava77 commented 3 years ago

@kpedro88 - sorry to bother again, but I am a little confused as to how is it possible to compare plots from workflows 23234.0 and 23434.999. The workflow 23234.0 has no pileups, but 23434.999 has.

more pileup -> more hits are expected, but we get less. That's a good enough comparison to show that there are problems.

adas1994 commented 3 years ago

I used similar new samples for StdMix-PreMix comparison this time - Blue is for Reference StandardMix, Pink is for PreMix, Red is for difference. RecHits_HGCEE_energy RecHits_HGCHEB_energy RecHits_HGCHEF_energy

The distribution and multiplicity of the rechits energy histograms are significantly reduced, just for changing CMSSW release area. Any possible explanation - @kpedro88 ?

kpedro88 commented 3 years ago

@adas1994 can you clarify exactly which CMSSW release you used and what samples you produced?

adas1994 commented 3 years ago

For yesterday's plots- I had used CMSSW_11_1_0_pre8. PU samples - /store/relval/CMSSW_10_6_0_patch2/RelValMinBias_14TeV/GEN-SIM/106X_upgrade2023_reali\ stic_v3_2023D41noPU-v1/10000/*.root

For today's plots - CMSSW_11_2_0_pre8 and PU samples - /store/relval/CMSSW_11_0_0_pre13/RelValMinBias_14TeV/GEN-SIM/110X_mcRun4_realistic_v\ 2_2026D49noPU-v1/20000/*.root Underlying Generator Event same for Both cases. process.generator = cms.EDProducer("FlatRandomPtGunProducer", AddAntiParticle = cms.bool(True), PGunParameters = cms.PSet( MaxEta = cms.double(2.75), MaxPhi = cms.double(3.14159265359), MaxPt = cms.double(5.0), MinEta = cms.double(1.75), MinPhi = cms.double(-3.14159265359), MinPt = cms.double(5.0), PartID = cms.vint32(130) ), Verbosity = cms.untracked.int32(0), firstRun = cms.untracked.uint32(1), psethack = cms.string('single K0_L pt 5') )

Is this info sufficient @kpedro88 ?

kpedro88 commented 3 years ago

@adas1994 I'm not sure we learn much moving from 10_6 to 11_0 MinBias samples. They could both be considered out of date. My suggestion was to generate your own MinBias sample using 11_2_0_pre8, to ensure consistency.

adas1994 commented 3 years ago

@adas1994 I'm not sure we learn much moving from 10_6 to 11_0 MinBias samples. They could both be considered out of date. My suggestion was to generate your own MinBias sample using 11_2_0_pre8, to ensure consistency.

I don't know how to generate my own minbias samples that using runThematrix.py workflows. Can you provide an example code for generating my own minbias samples?

kpedro88 commented 3 years ago

@adas1994 there is a minbias workflow for D49: runTheMatrix.py -w upgrade -l 23240.0 --dryRun

That will print the cmsDriver commands for the workflow. You just need to run the GEN-SIM step (probably with more events than the default 10). Then use the resulting root file as the --pileup_input for the classical mixing and premixing workflows.

adas1994 commented 3 years ago

@adas1994 there is a minbias workflow for D49: runTheMatrix.py -w upgrade -l 23240.0 --dryRun

That will print the cmsDriver commands for the workflow. You just need to run the GEN-SIM step (probably with more events than the default 10). Then use the resulting root file as the --pileup_input for the classical mixing and premixing workflows.

And How do I specify the number of PUs, like 50 or 200 and whether I want a Poisson distribution of mixing or whether do I want fixed number of pileups in this Gen-SIm step?

kpedro88 commented 3 years ago

in 23434.999 this is already specified with the cmsDriver argument: --pileup AVE_50_BX_25ns_m3p3. Just make sure to use the same argument for classical mixing.

adas1994 commented 3 years ago

@kpedro88 , with the freshly made minbias samples, the rechits enrgy profile are similar as my last comparison plots, i.e. completely different for stdmix and premix. I guess I have no other choice than delve into the code.

slava77 commented 3 years ago

@kpedro88 @adas1994 what is the status of this issue?

from the comments in #32397 it may seem that the problem is resolved, in IBs starting from 12-16-2300

adas1994 commented 3 years ago

I would say it is partly resolved. The current IB at least solves the urgent issue of stark rechits mismatch between StdMix-PreMix. I had been working on resolving some of the lurking bugs. But I will be on vacation till 4th Jan, 2021. So, I will be able to update you after that.

thanks,

thanks,

On Fri, Dec 18, 2020 at 9:37 PM Slava Krutelyov notifications@github.com wrote:

@kpedro88 https://github.com/kpedro88 @adas1994 https://github.com/adas1994 what is the status of this issue?

from the comments in #32397 https://github.com/cms-sw/cmssw/pull/32397 it may seem that the problem is resolved, in IBs starting from 12-16-2300

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-748407140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4OAJKGZZKORVQ4RWJTSVQGWVANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

srimanob commented 3 years ago

Hi @adas1994 Could you please update the status on further bug fix (if plan), or further investigation on this premixing issue on your side? Thanks in advance.

adas1994 commented 3 years ago

Sorry for the late reply Phat. The status is that I am stuck in the testing phase of my freshly written code. The issue is that - for high PU, the code is facing segmentation fault in the DIGI step. But when I try running it using Valgrind to figure out potential memory leak issue, it does run smoothly without any segmentation violation. So, I have not been able to pinpoint the issue so far.

On Sun, Jan 17, 2021 at 12:17 AM Phat Srimanobhas notifications@github.com wrote:

Hi @adas1994 https://github.com/adas1994 Could you please update the status on further bug fix (if plan), or further investigation on this premixing issue on your side? Thanks in advance.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-761734149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4KGN5BDJIGNNJCGSD3S2JXFJANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

srimanob commented 3 years ago

Thanks, @adas1994 for updating us the status. Is the issue come with a specific event, or since the first event? If you control the set of random numbers, it may help to pinpoint the issue.

adas1994 commented 3 years ago

When I run without Valgrind - It often crashes via segmentation fault on the 2nd event; sometimes in the 3rd or 4th or 5th event #PU=200. With Valgrind - I ran upto 20 events with #PU=400 without a crash.

My best guess thus far is that - I modified the class PHGCSimAccumulator in https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h which is very slowly leaking memory when being saved in a root file. But so far I can't detect it.

I did not understand how to pinpoint the issue by controlling the set of random numbers.

thanks for the help,

On Mon, Jan 18, 2021 at 4:52 AM Phat Srimanobhas notifications@github.com wrote:

Thanks, @adas1994 https://github.com/adas1994 for updating us the status. Is the issue come with a specific event, or since the first event? If you control the set of random numbers, it may help to pinpoint the issue.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-762128908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4M5RTQSHVM342ELEX3S2QAE7ANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

@adas1994 valgrind is often not the most helpful tool to debug segmentation violations. Instead, it's better to recompile with debug symbols and run using gdb to get more insight into the stack trace.

To recompile with debug symbols:

scram b clean
scram b USER_CXXFLAGS="-g"

To run with GDB:

gdb cmsRun
run config.py
(wait for segfault)
backtrace
(navigate to frames, print variable values, etc.)
adas1994 commented 3 years ago

@kpedro88 Does this "backtrace" command create some outputfile that I would have to parse in the next step for navigating to frames etc.?

On Mon, Jan 18, 2021 at 12:24 PM Kevin Pedro notifications@github.com wrote:

@adas1994 https://github.com/adas1994 valgrind is often not the most helpful tool to debug segmentation violations. Instead, it's better to recompile with debug symbols and run using gdb to get more insight into the stack trace.

To recompile with debug symbols:

scram b clean scram b USER_CXXFLAGS="-g"

To run with GDB:

gdb cmsRun run config.py (wait for segfault) backtrace (navigate to frames, print variable values, etc.)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-762382042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4KHLTVKM62ENG4TXHTS2RVDTANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

backtrace will print text similar to what you see from a regular stack trace from a segfault; the number of each line in the backtrace corresponds to its frame. There are various gdb tutorials online.

srimanob commented 3 years ago

@adas1994 Could you please update the status? Thanks.

adas1994 commented 3 years ago

To my best effort, I could not solve the segmentation fault issue. But I had to stop working on it because my University's PhD candidacy exam is coming up. I know I'm asking too much, but would it be possible for someone to take it up, if I upload my code in my github account?

On Thu, Feb 11, 2021 at 5:56 AM Phat Srimanobhas notifications@github.com wrote:

@adas1994 https://github.com/adas1994 Could you please update the status? Thanks.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777362100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IU2WBSEH2WEP5JQC3S6OZVLANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

@adas1994 yes, please share your branch

adas1994 commented 3 years ago

I will clean up my print statements and upload the code sometime before today's evening and notify you.

thanks,

On Thu, Feb 11, 2021 at 9:36 AM Kevin Pedro notifications@github.com wrote:

@adas1994 https://github.com/adas1994 yes, please share your branch

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777505606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4K6LAERXXBON3AD3NLS6PTPJANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

adas1994 commented 3 years ago

This is the branch where I have uploaded my changes : https://github.com/adas1994/cmssw/tree/segfault_branch The two main files that I have been modifying are - 1. https://github.com/adas1994/cmssw/blob/segfault_branch/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc 2. https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h

And my best guess is that I made some problematic change in https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h that is causing segmentation fault for high PU. But I really can't find what that mistake is.

Please let me know what you think.

thanks,

On Thu, Feb 11, 2021 at 9:38 AM Abhishek Das adas@nd.edu wrote:

I will clean up my print statements and upload the code sometime before today's evening and notify you.

thanks,

On Thu, Feb 11, 2021 at 9:36 AM Kevin Pedro notifications@github.com wrote:

@adas1994 https://github.com/adas1994 yes, please share your branch

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777505606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4K6LAERXXBON3AD3NLS6PTPJANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

@adas1994 can you provide some more details:

  1. which CMSSW version
  2. which workflow/step is crashing
adas1994 commented 3 years ago
  1. CMSSW_11_1_0_pre8
  2. The crash is happening in the 2nd step of any generic Premix workflow such as 23434.99 using PU no. 200, 300 etc. I tried both available minbias dataset and self-produced minbias dataset using workflow npo. 23240.0

On Thu, Feb 11, 2021 at 5:33 PM Kevin Pedro notifications@github.com wrote:

@adas1994 https://github.com/adas1994 can you provide some more details:

  1. which CMSSW version
  2. which workflow/step is crashing

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777837295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IMUO4A6EE7MSCTVNLS6RLMRANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

@adas1994 I was able to reproduce the segmentation violation. The traceback in gdb shows that a pointer accessed in SimTracker/SiPhase2Digitizer/plugins/Phase2TrackerDigitizer.cc ends up with an inaccessible memory address. However, I think this is a side effect of the actual problem; since your branch didn't modify the tracker code*, it's more likely that your code is writing to the wrong memory address somewhere.

I've preliminarily confirmed this with valgrind, which reports invalid writes coming from: https://github.com/adas1994/cmssw/blob/7c3420d6fc5162ffc22c724ff497257da0fc1a8c/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc#L636-L637

My immediate suspicion is that this can be an invalid write if findPos is the beginning of the map.

* please be advised that the branch you provided is not directly usable, because https://github.com/cms-sw/cmssw/commit/69f2b97ba85c35ec4dc388a6f22622461389b9ae deletes a bunch of tracker files. I had to rebase out this commit in order to run the tests.

adas1994 commented 3 years ago

Can you share the valgrind output file that tells you that it is probably coming from https://github.com/adas1994/cmssw/blob/7c3420d6fc5162ffc22c724ff497257da0fc1a8c/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc#L636-L637 ? That was the main bottleneck for me that I could not meaningfully understand it tries to point me to.

thanks Kevin,

On Fri, Feb 12, 2021 at 2:03 PM Kevin Pedro notifications@github.com wrote:

@adas1994 https://github.com/adas1994 I was able to reproduce the segmentation violation. The traceback in gdb shows that a pointer accessed in SimTracker/SiPhase2Digitizer/plugins/Phase2TrackerDigitizer.cc ends up with an inaccessible memory address. However, I think this is a side effect of the actual problem; since your branch didn't modify the tracker code*, it's more likely that your code is writing to the wrong memory address somewhere.

I've preliminarily confirmed this with valgrind, which reports invalid writes coming from:

https://github.com/adas1994/cmssw/blob/7c3420d6fc5162ffc22c724ff497257da0fc1a8c/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc#L636-L637

My immediate suspicion is that this can be an invalid write if findPos is the beginning of the map.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778388856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IO7KJYRUWYZYA7ZD3S6V3QTANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log

adas1994 commented 3 years ago

Got it. Thank you. Will go through the code again and make necessary modifications.

On Fri, Feb 12, 2021 at 2:25 PM Kevin Pedro notifications@github.com wrote:

The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log https://github.com/cms-sw/cmssw/files/5973925/log_valgrind_excerpt1.log

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778402994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4LNUGTJUS5LDQLN3T3S6V6DVANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

adas1994 commented 3 years ago

@kpedro88 I also remember that you may not get any crash when running Valgrind. I ran valgrind with Avg. PU 500 and it did not crash. That's why I thought may be Valgrind output is not that helpful.

On Fri, Feb 12, 2021 at 2:31 PM Abhishek Das adas@nd.edu wrote:

Got it. Thank you. Will go through the code again and make necessary modifications.

On Fri, Feb 12, 2021 at 2:25 PM Kevin Pedro notifications@github.com wrote:

The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log https://github.com/cms-sw/cmssw/files/5973925/log_valgrind_excerpt1.log

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778402994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4LNUGTJUS5LDQLN3T3S6V6DVANCNFSM4TM2PLZA .

-- Sincerely,

Abhishek Das.

-- Sincerely,

Abhishek Das.

kpedro88 commented 3 years ago

yes, valgrind may rearrange memory such that an overwrite will happen in a different place, and may not cause the program to crash. however, valgrind will still report the invalid write.