Open slava77 opened 4 years ago
assign upgrade
New categories assigned: upgrade
@kpedro88 you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @slava77 Slava Krutelyov.
@Dr15Jones, @dpiparo, @silviodonato, @smuzaffar, @makortel, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@makortel @adas1994 please take a look if you have time
@makortel @adas1994 please take a look if you have time
Don't expect anything from me (beyond looking specific code someone else points to).
What's in the x-axis ? Is it Energy or Eta ? @kpedro88
@adas1994 these plots show RecHit energy
@adas1994 these plots show RecHit energy
I don't understand the negative energy. But, give me this weekend. Let me see if I can find something out.
Sorry, it's log10(energy), to display the low values better.
Last week I was almost off work and facing a huge backlog of work to push through. I know I promised to get something about it by this Monday but I failed. I am hoping to get something done about it within a couple of days though. I am really sorry for the delay @kpedro88 .
@kpedro88 I have looked into the plots in detail using K0_L thrown directly at HGCAL at 5 GeV. And it is a pretty serious issue actually. I don't have a clue at present why the energy distribution would be different from the reference, as I changed nothing on the accumulated charge but changed only how timing is computed. I would have to look into the code for possible bugs I might have introduced.
Here are some plots.
@adas1994 are you using minbias that you generated yourself, or the official sample used by runTheMatrix?
@kpedro88 I used official Sample used by runTheMatrix. One quick question - this is a very legit open issue and needs to be addressed ASAP. But I will remain extremely busy for the next few months. I was just wondering if there is any deadline before which this needs to be addressed. I would have to reschedule stuff in that case.
@adas1994 it might be worth trying a new sample generated in the same release before digging through the code. it's possible the old sample is no longer 100% compatible.
@kpedro88 I will try new samples and update you on it as soon as I can.
@kpedro88 - sorry to bother again, but I am a little confused as to how is it possible to compare plots from workflows 23234.0 and 23434.999. The workflow 23234.0 has no pileups, but 23434.999 has.
@adas1994 that's true, I read too fast: 23434.0 (PU workflow, needs to be modified to set PU to 50) should be compared to 23434.999. @slava77 FYI
@adas1994 that's true, I read too fast: 23434.0 (PU workflow, needs to be modified to set PU to 50) should be compared to 23434.999. @slava77 FYI
Cool. I will ping you once I stumble upon something concrete.
@kpedro88 - sorry to bother again, but I am a little confused as to how is it possible to compare plots from workflows 23234.0 and 23434.999. The workflow 23234.0 has no pileups, but 23434.999 has.
more pileup -> more hits are expected, but we get less. That's a good enough comparison to show that there are problems.
I used similar new samples for StdMix-PreMix comparison this time - Blue is for Reference StandardMix, Pink is for PreMix, Red is for difference.
The distribution and multiplicity of the rechits energy histograms are significantly reduced, just for changing CMSSW release area. Any possible explanation - @kpedro88 ?
@adas1994 can you clarify exactly which CMSSW release you used and what samples you produced?
For yesterday's plots- I had used CMSSW_11_1_0_pre8. PU samples - /store/relval/CMSSW_10_6_0_patch2/RelValMinBias_14TeV/GEN-SIM/106X_upgrade2023_reali\ stic_v3_2023D41noPU-v1/10000/*.root
For today's plots -
CMSSW_11_2_0_pre8 and PU samples -
/store/relval/CMSSW_11_0_0_pre13/RelValMinBias_14TeV/GEN-SIM/110X_mcRun4_realistic_v\
2_2026D49noPU-v1/20000/*.root
Underlying Generator Event same for Both cases.
process.generator = cms.EDProducer("FlatRandomPtGunProducer", AddAntiParticle = cms.bool(True), PGunParameters = cms.PSet( MaxEta = cms.double(2.75), MaxPhi = cms.double(3.14159265359), MaxPt = cms.double(5.0), MinEta = cms.double(1.75), MinPhi = cms.double(-3.14159265359), MinPt = cms.double(5.0), PartID = cms.vint32(130) ), Verbosity = cms.untracked.int32(0), firstRun = cms.untracked.uint32(1), psethack = cms.string('single K0_L pt 5') )
Is this info sufficient @kpedro88 ?
@adas1994 I'm not sure we learn much moving from 10_6 to 11_0 MinBias samples. They could both be considered out of date. My suggestion was to generate your own MinBias sample using 11_2_0_pre8, to ensure consistency.
@adas1994 I'm not sure we learn much moving from 10_6 to 11_0 MinBias samples. They could both be considered out of date. My suggestion was to generate your own MinBias sample using 11_2_0_pre8, to ensure consistency.
I don't know how to generate my own minbias samples that using runThematrix.py workflows. Can you provide an example code for generating my own minbias samples?
@adas1994 there is a minbias workflow for D49: runTheMatrix.py -w upgrade -l 23240.0 --dryRun
That will print the cmsDriver commands for the workflow. You just need to run the GEN-SIM step (probably with more events than the default 10). Then use the resulting root file as the --pileup_input for the classical mixing and premixing workflows.
@adas1994 there is a minbias workflow for D49:
runTheMatrix.py -w upgrade -l 23240.0 --dryRun
That will print the cmsDriver commands for the workflow. You just need to run the GEN-SIM step (probably with more events than the default 10). Then use the resulting root file as the --pileup_input for the classical mixing and premixing workflows.
And How do I specify the number of PUs, like 50 or 200 and whether I want a Poisson distribution of mixing or whether do I want fixed number of pileups in this Gen-SIm step?
in 23434.999
this is already specified with the cmsDriver argument: --pileup AVE_50_BX_25ns_m3p3
. Just make sure to use the same argument for classical mixing.
@kpedro88 , with the freshly made minbias samples, the rechits enrgy profile are similar as my last comparison plots, i.e. completely different for stdmix and premix. I guess I have no other choice than delve into the code.
@kpedro88 @adas1994 what is the status of this issue?
from the comments in #32397 it may seem that the problem is resolved, in IBs starting from 12-16-2300
I would say it is partly resolved. The current IB at least solves the urgent issue of stark rechits mismatch between StdMix-PreMix. I had been working on resolving some of the lurking bugs. But I will be on vacation till 4th Jan, 2021. So, I will be able to update you after that.
thanks,
thanks,
On Fri, Dec 18, 2020 at 9:37 PM Slava Krutelyov notifications@github.com wrote:
@kpedro88 https://github.com/kpedro88 @adas1994 https://github.com/adas1994 what is the status of this issue?
from the comments in #32397 https://github.com/cms-sw/cmssw/pull/32397 it may seem that the problem is resolved, in IBs starting from 12-16-2300
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-748407140, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4OAJKGZZKORVQ4RWJTSVQGWVANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
Hi @adas1994 Could you please update the status on further bug fix (if plan), or further investigation on this premixing issue on your side? Thanks in advance.
Sorry for the late reply Phat. The status is that I am stuck in the testing phase of my freshly written code. The issue is that - for high PU, the code is facing segmentation fault in the DIGI step. But when I try running it using Valgrind to figure out potential memory leak issue, it does run smoothly without any segmentation violation. So, I have not been able to pinpoint the issue so far.
On Sun, Jan 17, 2021 at 12:17 AM Phat Srimanobhas notifications@github.com wrote:
Hi @adas1994 https://github.com/adas1994 Could you please update the status on further bug fix (if plan), or further investigation on this premixing issue on your side? Thanks in advance.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-761734149, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4KGN5BDJIGNNJCGSD3S2JXFJANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
Thanks, @adas1994 for updating us the status. Is the issue come with a specific event, or since the first event? If you control the set of random numbers, it may help to pinpoint the issue.
When I run without Valgrind - It often crashes via segmentation fault on the 2nd event; sometimes in the 3rd or 4th or 5th event #PU=200. With Valgrind - I ran upto 20 events with #PU=400 without a crash.
My best guess thus far is that - I modified the class PHGCSimAccumulator in https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h which is very slowly leaking memory when being saved in a root file. But so far I can't detect it.
I did not understand how to pinpoint the issue by controlling the set of random numbers.
thanks for the help,
On Mon, Jan 18, 2021 at 4:52 AM Phat Srimanobhas notifications@github.com wrote:
Thanks, @adas1994 https://github.com/adas1994 for updating us the status. Is the issue come with a specific event, or since the first event? If you control the set of random numbers, it may help to pinpoint the issue.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-762128908, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4M5RTQSHVM342ELEX3S2QAE7ANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
@adas1994 valgrind is often not the most helpful tool to debug segmentation violations. Instead, it's better to recompile with debug symbols and run using gdb to get more insight into the stack trace.
To recompile with debug symbols:
scram b clean
scram b USER_CXXFLAGS="-g"
To run with GDB:
gdb cmsRun
run config.py
(wait for segfault)
backtrace
(navigate to frames, print variable values, etc.)
@kpedro88 Does this "backtrace" command create some outputfile that I would have to parse in the next step for navigating to frames etc.?
On Mon, Jan 18, 2021 at 12:24 PM Kevin Pedro notifications@github.com wrote:
@adas1994 https://github.com/adas1994 valgrind is often not the most helpful tool to debug segmentation violations. Instead, it's better to recompile with debug symbols and run using gdb to get more insight into the stack trace.
To recompile with debug symbols:
scram b clean scram b USER_CXXFLAGS="-g"
To run with GDB:
gdb cmsRun run config.py (wait for segfault) backtrace (navigate to frames, print variable values, etc.)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-762382042, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4KHLTVKM62ENG4TXHTS2RVDTANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
backtrace will print text similar to what you see from a regular stack trace from a segfault; the number of each line in the backtrace corresponds to its frame. There are various gdb tutorials online.
@adas1994 Could you please update the status? Thanks.
To my best effort, I could not solve the segmentation fault issue. But I had to stop working on it because my University's PhD candidacy exam is coming up. I know I'm asking too much, but would it be possible for someone to take it up, if I upload my code in my github account?
On Thu, Feb 11, 2021 at 5:56 AM Phat Srimanobhas notifications@github.com wrote:
@adas1994 https://github.com/adas1994 Could you please update the status? Thanks.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777362100, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IU2WBSEH2WEP5JQC3S6OZVLANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
@adas1994 yes, please share your branch
I will clean up my print statements and upload the code sometime before today's evening and notify you.
thanks,
On Thu, Feb 11, 2021 at 9:36 AM Kevin Pedro notifications@github.com wrote:
@adas1994 https://github.com/adas1994 yes, please share your branch
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777505606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4K6LAERXXBON3AD3NLS6PTPJANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
This is the branch where I have uploaded my changes : https://github.com/adas1994/cmssw/tree/segfault_branch The two main files that I have been modifying are - 1. https://github.com/adas1994/cmssw/blob/segfault_branch/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc 2. https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h
And my best guess is that I made some problematic change in https://github.com/adas1994/cmssw/blob/segfault_branch/DataFormats/HGCDigi/interface/PHGCSimAccumulator.h that is causing segmentation fault for high PU. But I really can't find what that mistake is.
Please let me know what you think.
thanks,
On Thu, Feb 11, 2021 at 9:38 AM Abhishek Das adas@nd.edu wrote:
I will clean up my print statements and upload the code sometime before today's evening and notify you.
thanks,
On Thu, Feb 11, 2021 at 9:36 AM Kevin Pedro notifications@github.com wrote:
@adas1994 https://github.com/adas1994 yes, please share your branch
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777505606, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4K6LAERXXBON3AD3NLS6PTPJANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
-- Sincerely,
Abhishek Das.
@adas1994 can you provide some more details:
On Thu, Feb 11, 2021 at 5:33 PM Kevin Pedro notifications@github.com wrote:
@adas1994 https://github.com/adas1994 can you provide some more details:
- which CMSSW version
- which workflow/step is crashing
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-777837295, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IMUO4A6EE7MSCTVNLS6RLMRANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
@adas1994 I was able to reproduce the segmentation violation. The traceback in gdb shows that a pointer accessed in SimTracker/SiPhase2Digitizer/plugins/Phase2TrackerDigitizer.cc
ends up with an inaccessible memory address. However, I think this is a side effect of the actual problem; since your branch didn't modify the tracker code*, it's more likely that your code is writing to the wrong memory address somewhere.
I've preliminarily confirmed this with valgrind, which reports invalid writes coming from: https://github.com/adas1994/cmssw/blob/7c3420d6fc5162ffc22c724ff497257da0fc1a8c/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc#L636-L637
My immediate suspicion is that this can be an invalid write if findPos
is the beginning of the map.
* please be advised that the branch you provided is not directly usable, because https://github.com/cms-sw/cmssw/commit/69f2b97ba85c35ec4dc388a6f22622461389b9ae deletes a bunch of tracker files. I had to rebase out this commit in order to run the tests.
Can you share the valgrind output file that tells you that it is probably coming from https://github.com/adas1994/cmssw/blob/7c3420d6fc5162ffc22c724ff497257da0fc1a8c/SimCalorimetry/HGCalSimProducers/src/HGCDigitizer.cc#L636-L637 ? That was the main bottleneck for me that I could not meaningfully understand it tries to point me to.
thanks Kevin,
On Fri, Feb 12, 2021 at 2:03 PM Kevin Pedro notifications@github.com wrote:
@adas1994 https://github.com/adas1994 I was able to reproduce the segmentation violation. The traceback in gdb shows that a pointer accessed in SimTracker/SiPhase2Digitizer/plugins/Phase2TrackerDigitizer.cc ends up with an inaccessible memory address. However, I think this is a side effect of the actual problem; since your branch didn't modify the tracker code*, it's more likely that your code is writing to the wrong memory address somewhere.
I've preliminarily confirmed this with valgrind, which reports invalid writes coming from:
My immediate suspicion is that this can be an invalid write if findPos is the beginning of the map.
- please be advised that the branch you provided is not directly usable, because 69f2b97 https://github.com/cms-sw/cmssw/commit/69f2b97ba85c35ec4dc388a6f22622461389b9ae deletes a bunch of tracker files. I had to rebase out this commit in order to run the tests.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778388856, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4IO7KJYRUWYZYA7ZD3S6V3QTANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log
Got it. Thank you. Will go through the code again and make necessary modifications.
On Fri, Feb 12, 2021 at 2:25 PM Kevin Pedro notifications@github.com wrote:
The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log https://github.com/cms-sw/cmssw/files/5973925/log_valgrind_excerpt1.log
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778402994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4LNUGTJUS5LDQLN3T3S6V6DVANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
@kpedro88 I also remember that you may not get any crash when running Valgrind. I ran valgrind with Avg. PU 500 and it did not crash. That's why I thought may be Valgrind output is not that helpful.
On Fri, Feb 12, 2021 at 2:31 PM Abhishek Das adas@nd.edu wrote:
Got it. Thank you. Will go through the code again and make necessary modifications.
On Fri, Feb 12, 2021 at 2:25 PM Kevin Pedro notifications@github.com wrote:
The relevant section of the log is attached. (The valgrind job is still running; I want to see if finds anything else.) log_valgrind_excerpt1.log https://github.com/cms-sw/cmssw/files/5973925/log_valgrind_excerpt1.log
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/32050#issuecomment-778402994, or unsubscribe https://github.com/notifications/unsubscribe-auth/AG64M4LNUGTJUS5LDQLN3T3S6V6DVANCNFSM4TM2PLZA .
-- Sincerely,
Abhishek Das.
-- Sincerely,
Abhishek Das.
yes, valgrind may rearrange memory such that an overwrite will happen in a different place, and may not cause the program to crash. however, valgrind will still report the invalid write.
followed by recent (e.g. https://github.com/cms-sw/cmssw/pull/31906#issuecomment-722578846 or https://github.com/cms-sw/cmssw/pull/32022#issuecomment-722025986) it looks like some of the premixing is not working in phase 2
here are a few plots comparing in CMSSW_11_2_X_2020-11-04-1100 23234.0 in black with 23434.999 in red (10 events; presumably the same sim), using the matrix outputs from the baseline jenkins job
HGCHEBRecHits are empty in 23434.999