Open mrodozov opened 3 years ago
assign simulation
New categories assigned: simulation
@civanch,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @mrodozov Mircho Rodozov.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@mrodozov what are the exact steps to reproduce the issues. Just to run specified workflows inside an UBSAN IB? Here I am primarily interested in the issue with PixelDigi.h
yes, just run the 11603.0 step 2 with the UBSAN IB, set SCRAM_ARCH=slc7_amd64_gcc10 , we use gcc10 for UBSAN setup the UBSAN IB, you may checkout DataFormats/SiPixelDigi if you want and then run the runTheMatrix -i all -l 11603.0 --ibeos -t4
@ferencek , have you any idea what happens?
@civanch, I plan to have a look this week.
@mrodozov, @civanch, I had a look at this but I am afraid this issue exceeds my level of expertise with C++. About two weeks ago I just had a quick look at the code and was quite confused because the problematic variable in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75, PixelChannelIdentifier::thePacking.column_width
, should be properly initialized. This time I followed the following recipe
voms-proxy-init -rfc -voms cms
export SCRAM_ARCH=slc7_amd64_gcc10
cmsrel CMSSW_12_1_UBSAN_X_2021-09-13-1100
cd CMSSW_12_1_UBSAN_X_2021-09-13-1100/src
cmsenv
git cms-addpkg DataFormats/SiPixelDigi
git cms-addpkg DataFormats/SiPixelDetId
git cms-checkdeps -a -A
scram b -j12
runTheMatrix.py -i all -l 11603.0 --ibeos -t4 --maxSteps=2 -j 0
and ran step2 with /store/relval/CMSSW_12_0_0_pre4/RelValSingleElectronPt1000/GEN-SIM/120X_mcRun3_2021_realistic_v2-v1/00000/c3b64309-ec4a-4c5f-b7a6-c97ef7b031fa.root
as input.
I was able to reproduce the problem and for some reason it occurs for the first time in the 4th event. So for the subsequent tests I modified the cfg file to skip three and process only one event. The error was still there. I added a cout
statement to the pixelToChannel(int row, int col)
method in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75 to debug the issue
static int pixelToChannel(int row, int col) {
std::cout << ">> column width: " << PixelChannelIdentifier::thePacking.column_width << std::endl;
return (row << PixelChannelIdentifier::thePacking.column_width) | col;
}
but the values printed out were always 10 as they should be. Here is an interesting portion of the printout
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
I even modified the data members in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDetId/interface/PixelChannelIdentifier.h#L33-L35 from int
to unsigned int
but the error was still there.
Interestingly enough, when I set the cfg file to skip the first two events, I noticed the following error which otherwise does not appear
/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/a1dd9deda35a37ab828b89609a57944f/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 2446543800, which is not a valid value for type 'SubDetector'
Any suggestions on what to check next would be very welcome.
Just a quick follow-up on the issue with PixelCPEBase.h
I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A
). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'
and here
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'
So there seems to be some randomness in the loaded value.
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1
@ferencek , I think problem is row
being -1
here https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75
@smuzaffar, good point, thanks. I was interpreting the error message incorrectly. OK, this will be helpful for further debugging.
OK, so here is the culprit https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/SimTracker/SiPixelDigitizer/plugins/SiPixelChargeReweightingAlgorithm.cc#L234 Now I need to figure out why the row coordinate gets evaluated to -1.
Just a quick follow-up on the issue with
PixelCPEBase.h
I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A
). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'
and here
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'
So there seems to be some randomness in the loaded value.
The problem with the PixelCPEBase has already been reported in https://github.com/cms-sw/cmssw/issues/35036
+1
the issue was fixed in #35337
The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:
check the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/