cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.29k forks source link

[UBSAN] Undefined behavior in DataFormats/* reco packages #35033

Open mrodozov opened 3 years ago

mrodozov commented 3 years ago

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

1302.17 step3
DataFormats/BTauReco/interface/TemplatedSecondaryVertexTagInfo.h:38:21: runtime error: load of value 4, which is not a valid value for type 'Status'

159.3 step3
DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

136.885 step3
DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:17:7: runtime error: load of value 72, which is not a valid value for type 'bool'

38634.0 step3
DataFormats/TrackReco/src/HitPattern.cc:102:42: runtime error: left shift of negative value -1

138.2 step2
DataFormats/TrajectorySeed/interface/TrajectorySeed.h:44:3: runtime error: load of value 10967, which is not a valid value for type 'PropagationDirection'

check the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

mrodozov commented 3 years ago

assign reconstruction

cmsbuild commented 3 years ago

New categories assigned: reconstruction

@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 3 years ago

A new Issue was created by @mrodozov Mircho Rodozov.

@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

slava77 commented 3 years ago

checklist

The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:

1302.17 step3 DataFormats/BTauReco/interface/TemplatedSecondaryVertexTagInfo.h:38:21: runtime error: load of value 4, which is not a valid value for type 'Status'

     enum Status { trackSelected = 0, trackUsedForVertexFit, trackAssociatedToVertex };

      inline bool associatedToVertex(unsigned int index) const {
        return (int)svStatus == (int)trackAssociatedToVertex + (int)index;//<== HERE
      }

      Status svStatus;

the warning points to a use case; I guess that the store may have happened here https://github.com/cms-sw/cmssw/blob/f0b288d78a08889965c93fc57409da6ea09359c0/RecoBTag/SecondaryVertex/plugins/TemplatedSecondaryVertexProducer.cc#L950-L951 and this conflicts with the 3-value enum type @emilbols

159.3 step3 DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate { I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet) @cms-sw/hcal-dpg-l2

136.885 step3 DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:17:7: runtime error: load of value 72, which is not a valid value for type 'bool'

this points to class CTPPSPixelRecHit {, which is not very helpful.

@cms-sw/ctpps-dpg-l2

38634.0 step3 DataFormats/TrackReco/src/HitPattern.cc:102:42: runtime error: left shift of negative value -1

this is apparently for

case MuonSubdetId::GEM: {
GEMDetId gemid(id.rawId());
layer = ((gemid.station() - 1) << 2); // <== HERE
layer |= abs(gemid.layer() - 1);

the workflow is D86 D86 = T24+C17+M10+I14+O8+F6 with an era Phase2C11I13M9 @jshlee @watson-ij please remind if the gemid.station() here going to be above 0? If it stays at zero, we need a fix for this new ME0 in HitPattern.cc

138.2 step2 DataFormats/TrajectorySeed/interface/TrajectorySeed.h:44:3: runtime error: load of value 10967, which is not a valid value for type 'PropagationDirection'

this points to TrajectorySeed(TrajectorySeed const& o) = default; I guess the original is a default-constructed TrajectorySeed, which has TrajectorySeed() {} In case we decide to use uninitialized , what is the way to make UBSAN to ignore this case? @VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases? @mtosi @vmariani @mmusich

check the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/

@mrodozov

jshlee commented 3 years ago

@slava77 for ME0 station we would like to keep it as 0.

mrodozov commented 3 years ago

@slava77 until saturday/sunday (don't remember in which day we publish the next week) it's in this week: /cvmfs/cms-ib.cern.ch/nweek-02694/slc7_amd64_gcc10/cms/cmssw which will be substituted at the end of the current week by nweek-02696 We can install the same IB elsewhere more permanently, but the root files we don't keep, they are gone once the IB is built Would installing the same IB on more permanent place be sufficient ?

@VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases

Here is one case on insisting on lazy init to keep a struct as а POD https://github.com/cms-sw/cmssw/pull/33742

mseidel42 commented 3 years ago
159.3 step3

DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate { I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet) @cms-sw/hcal-dpg-l2

@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?

fabferro commented 3 years ago

All this looks weird to me. As to CTPPSPixelRecHits, I've never seen this error before. Any reason why these errors are happening only in this IB? BTW, I tried to reproduce the error running the matrix for 136.885 but I just get OSError: [Errno 28] No space left on device: '/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-08-20-2300/python/PhysicsTools/PatAlgos/slimming/isolatedTracks_cfi.py'

mrodozov commented 3 years ago

This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

OSError: [Errno 28] No space left on device:

could you share what command you are using to run the relval and on what machine ?

fabferro commented 3 years ago

This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html

OSError: [Errno 28] No space left on device:

could you share what command you are using to run the relval and on what machine ?

lxplus. Just (naively) installing cmsrel CMSSW_12_1_UBSAN_X_2021-08-20-2300 and then running runTheMatrix.py -l 136.885

mrodozov commented 3 years ago

would you try from lxplus, log onto cmsdev* machine and retry

ssh cmsdev25
mkdir -p /build/$USER ; cd /build/$USER
export SCRAM_ARCH=slc7_amd64_gcc10;
scram p CMSSW_12_1_UBSAN_X_2021-08-20-2300
cd CMSSW_12_1_UBSAN_X_2021-08-20-2300 ; cmsenv ;
runTheMatrix.py -l 136.885 -i all --ibeos
fabferro commented 3 years ago

It took ages but I could run 136.885, confirming the problem for CTPPSPixelRecHits. With @jan-kaspar we worked out a possible solution: removing line https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L19 and adding default values here https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L21-L22

@slava77 Shall I submit a PR?

mrodozov commented 3 years ago

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

slava77 commented 3 years ago

@slava77 Shall I submit a PR?

@fabferro yes, please. I think that for PPS hit multiplicities there is no need to worry about the cost of initializing a hit data if it will not be used.

slava77 commented 3 years ago

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

That's expected for most cases in this and other reco-related UBSAN issues: the symptoms are always for uninitialized values (which are also apparently not used at least in a few cases, beyond a load). The uninit values are often random (and not repeatable).

fabferro commented 3 years ago

Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures

No. I ran it once before changing the code and the error was there, I ran once after the changes and the error disappeared. Actually, for some reason, it took hours to run 136.885, so it would have been impossible to repeate it several times...

abdoulline commented 3 years ago
159.3 step3

DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'

the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate { I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet) @cms-sw/hcal-dpg-l2

@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?

@mseidel42 I don't immediately see a reason of the issue, which might have been present there since > 10 years. It requires some investigation (via calotowermaker), which is beyond of what I can do at the moment from vacation. Hopefully I'll have a possibility to do it some time next week, still being on vacation.

NB: there seems to be yet another issue in CaloTower (along with the aforementioned 'HcalSubdetector' one): https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/159.3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO/step3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO.log

-> CaloTower.h:26:7: runtime error: load of value 32, which is not a valid value for type 'bool'

mrodozov commented 3 years ago

present there since > 10 years

yes this issues were existing for long time that's why cleaning them out will narrow consequent bugs that pop up for no obvious reasons. although for all bugs we know we have some idea non of which is related to this reports (afaik) in short now that we know the bugs are there lets do something

slava77 commented 3 years ago

@slava77 for ME0 station we would like to keep it as 0.

@jshlee please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

slava77 commented 3 years ago

@emilbols @vmariani @mmusich

this is a kind ping on the pending issues as detailed in https://github.com/cms-sw/cmssw/issues/35033#issuecomment-906831823

jshlee commented 3 years ago

@slava77 for ME0 station we would like to keep it as 0.

@jshlee please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

hi @slava77, where is the code implemented?

slava77 commented 3 years ago

@slava77 for ME0 station we would like to keep it as 0.

@jshlee please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1 is not used in this case.

hi @slava77, where is the code implemented?

https://github.com/cms-sw/cmssw/blob/6a5dc2ebb1ddf1611f28cfb1a22a5af8b41a3b9a/DataFormats/TrackReco/src/HitPattern.cc#L100-L104

emilbols commented 3 years ago

@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.

I tried to run this on cmsdev25 following the recipe above with workflow 1302.17 and switching to CMSSW_12_1_UBSAN_X_2021-09-03-2300, but i dont get the error. If i understand the error correctly, it is most likely because it only appears when there are more than 1 SV in the event which the track is being associated to, so there is probably some randomness in whether you get it

slava77 commented 3 years ago

@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.

@emilbols thank you for checking the code and clarifying its intents in using the variable. I think that the design span of values should match the type. So, if a change to an int is needed, let's update it. It was not obvious to me in this case, but perhaps this case also implies that a new variable is needed instead instead of an "easy" overload of what was originally designed to be an enum.

As for reproducibility, you may want to enable more threads (the reference job used 4 threads) and add more events.

slava77 commented 2 years ago

+reconstruction

the summary was updated at https://github.com/cms-sw/cmssw/issues/35033#issuecomment-906831823