cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.29k forks source link

[UBSAN] Undefined behavior in DataFormats/* simulation packages #35034

Open mrodozov opened 3 years ago

mrodozov commented 3 years ago

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

11603.0 step2 
DataFormats/SiPixelDigi/interface/PixelDigi.h:75:61: runtime error: left shift of negative value -1

136.882 step3
DataFormats/RPCDigi/interface/RecordBX.h:21:22: runtime error: left shift of negative value -1

1302.17 step1
DataFormats/HcalDetId/src/HcalCastorDetId.cc:11:49: runtime error: left shift of negative value -1

136.821 step3
DataFormats/Math/interface/approx_exp.h:162:32: runtime error: signed integer overflow: 2147483647 + 127 cannot be represented in type 'int'

11603.0 step2
DataFormats/Math/interface/libminifloat.h:74:78: runtime error: shift exponent -1 is negative

check the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

mrodozov commented 3 years ago

assign simulation

cmsbuild commented 3 years ago

New categories assigned: simulation

@civanch,@mdhildreth you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 3 years ago

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

ferencek commented 3 years ago

@mrodozov what are the exact steps to reproduce the issues. Just to run specified workflows inside an UBSAN IB? Here I am primarily interested in the issue with PixelDigi.h

mrodozov commented 3 years ago

yes, just run the 11603.0 step 2 with the UBSAN IB, set SCRAM_ARCH=slc7_amd64_gcc10 , we use gcc10 for UBSAN setup the UBSAN IB, you may checkout DataFormats/SiPixelDigi if you want and then run the runTheMatrix -i all -l 11603.0 --ibeos -t4

civanch commented 3 years ago

@ferencek , have you any idea what happens?

ferencek commented 3 years ago

@civanch, I plan to have a look this week.

ferencek commented 3 years ago

@mrodozov, @civanch, I had a look at this but I am afraid this issue exceeds my level of expertise with C++. About two weeks ago I just had a quick look at the code and was quite confused because the problematic variable in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75, PixelChannelIdentifier::thePacking.column_width, should be properly initialized. This time I followed the following recipe

voms-proxy-init -rfc -voms cms

export SCRAM_ARCH=slc7_amd64_gcc10
cmsrel CMSSW_12_1_UBSAN_X_2021-09-13-1100
cd CMSSW_12_1_UBSAN_X_2021-09-13-1100/src
cmsenv
git cms-addpkg DataFormats/SiPixelDigi
git cms-addpkg DataFormats/SiPixelDetId
git cms-checkdeps -a -A
scram b -j12

runTheMatrix.py -i all -l 11603.0 --ibeos -t4 --maxSteps=2 -j 0

and ran step2 with /store/relval/CMSSW_12_0_0_pre4/RelValSingleElectronPt1000/GEN-SIM/120X_mcRun3_2021_realistic_v2-v1/00000/c3b64309-ec4a-4c5f-b7a6-c97ef7b031fa.root as input.

I was able to reproduce the problem and for some reason it occurs for the first time in the 4th event. So for the subsequent tests I modified the cfg file to skip three and process only one event. The error was still there. I added a cout statement to the pixelToChannel(int row, int col) method in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75 to debug the issue

  static int pixelToChannel(int row, int col) { 
      std::cout << ">> column width: " << PixelChannelIdentifier::thePacking.column_width << std::endl;
      return (row << PixelChannelIdentifier::thePacking.column_width) | col;
  }

but the values printed out were always 10 as they should be. Here is an interesting portion of the printout

>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10
>> column width: 10

I even modified the data members in https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDetId/interface/PixelChannelIdentifier.h#L33-L35 from int to unsigned int but the error was still there.

Interestingly enough, when I set the cfg file to skip the first two events, I noticed the following error which otherwise does not appear

/data/cmsbld/jenkins/workspace/build-any-ib/w/tmp/BUILDROOT/a1dd9deda35a37ab828b89609a57944f/opt/cmssw/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 2446543800, which is not a valid value for type 'SubDetector'

Any suggestions on what to check next would be very welcome.

ferencek commented 3 years ago

Just a quick follow-up on the issue with PixelCPEBase.h I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'

and here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'

So there seems to be some randomness in the loaded value.

smuzaffar commented 3 years ago

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/DataFormats/SiPixelDigi/interface/PixelDigi.h:78:19: runtime error: left shift of negative value -1

@ferencek , I think problem is row being -1 here https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/DataFormats/SiPixelDigi/interface/PixelDigi.h#L75

ferencek commented 3 years ago

@smuzaffar, good point, thanks. I was interpreting the error message incorrectly. OK, this will be helpful for further debugging.

ferencek commented 3 years ago

OK, so here is the culprit https://github.com/cms-sw/cmssw/blob/CMSSW_12_1_0_pre1/SimTracker/SiPixelDigitizer/plugins/SiPixelChargeReweightingAlgorithm.cc#L234 Now I need to figure out why the row coordinate gets evaluated to -1.

ferencek commented 3 years ago

Just a quick follow-up on the issue with PixelCPEBase.h I mentioned above. The first time I observed it was before I locally compiled all dependent packages (git cms-checkdeps -a -A). After compiling locally, I checked the issue two more times and each time the reported value was different as you can see here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905974791, which is not a valid value for type 'SubDetector'

and here

/home/ferencek/PixelOffline/UBSAN/CMSSW_12_1_UBSAN_X_2021-09-13-1100/src/RecoLocalTracker/SiPixelRecHits/interface/PixelCPEBase.h:45:10: runtime error: load of value 905975319, which is not a valid value for type 'SubDetector'

So there seems to be some randomness in the loaded value.

The problem with the PixelCPEBase has already been reported in https://github.com/cms-sw/cmssw/issues/35036

civanch commented 2 years ago

+1

the issue was fixed in #35337