Open mrodozov opened 3 years ago
assign reconstruction
New categories assigned: reconstruction
@slava77,@jpata you have been requested to review this Pull request/Issue and eventually sign? Thanks
A new Issue was created by @mrodozov Mircho Rodozov.
@Dr15Jones, @perrotta, @dpiparo, @makortel, @smuzaffar, @qliphy can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
checklist
The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:
1302.17 step3 DataFormats/BTauReco/interface/TemplatedSecondaryVertexTagInfo.h:38:21: runtime error: load of value 4, which is not a valid value for type 'Status'
enum Status { trackSelected = 0, trackUsedForVertexFit, trackAssociatedToVertex };
inline bool associatedToVertex(unsigned int index) const {
return (int)svStatus == (int)trackAssociatedToVertex + (int)index;//<== HERE
}
Status svStatus;
the warning points to a use case; I guess that the store may have happened here https://github.com/cms-sw/cmssw/blob/f0b288d78a08889965c93fc57409da6ea09359c0/RecoBTag/SecondaryVertex/plugins/TemplatedSecondaryVertexProducer.cc#L950-L951 and this conflicts with the 3-value enum type @emilbols
159.3 step3 DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'
the lineno is not particularly helpful
class CaloTower : public reco::LeafCandidate {
I can guess that this is a caloTower with uninitializedHcalSubdetector subdet_
(created without a subsequentsetHcalSubdet
) @cms-sw/hcal-dpg-l2136.885 step3 DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h:17:7: runtime error: load of value 72, which is not a valid value for type 'bool'
this points to class CTPPSPixelRecHit {
, which is not very helpful.
@cms-sw/ctpps-dpg-l2
38634.0 step3 DataFormats/TrackReco/src/HitPattern.cc:102:42: runtime error: left shift of negative value -1
this is apparently for
case MuonSubdetId::GEM: { GEMDetId gemid(id.rawId()); layer = ((gemid.station() - 1) << 2); // <== HERE layer |= abs(gemid.layer() - 1);
the workflow is D86
D86 = T24+C17+M10+I14+O8+F6
with an eraPhase2C11I13M9
@jshlee @watson-ij please remind if thegemid.station()
here going to be above 0? If it stays at zero, we need a fix for this new ME0 in HitPattern.cc138.2 step2 DataFormats/TrajectorySeed/interface/TrajectorySeed.h:44:3: runtime error: load of value 10967, which is not a valid value for type 'PropagationDirection'
this points to
TrajectorySeed(TrajectorySeed const& o) = default;
I guess the original is a default-constructedTrajectorySeed
, which hasTrajectorySeed() {}
In case we decide to use uninitialized , what is the way to make UBSAN to ignore this case? @VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases? @mtosi @vmariani @mmusichcheck the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/
@mrodozov
@slava77 for ME0 station we would like to keep it as 0.
@slava77 until saturday/sunday (don't remember in which day we publish the next week)
it's in this week:
/cvmfs/cms-ib.cern.ch/nweek-02694/slc7_amd64_gcc10/cms/cmssw
which will be substituted at the end of the current week by nweek-02696
We can install the same IB elsewhere more permanently, but the root files we don't keep, they are gone once the IB is built
Would installing the same IB on more permanent place be sufficient ?
@VinInn I recall that you were in favor of not initializing data in some cases; is it one of these cases
Here is one case on insisting on lazy init to keep a struct as а POD https://github.com/cms-sw/cmssw/pull/33742
159.3 step3
DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'
the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate { I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet) @cms-sw/hcal-dpg-l2
@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?
All this looks weird to me. As to CTPPSPixelRecHits, I've never seen this error before. Any reason why these errors are happening only in this IB? BTW, I tried to reproduce the error running the matrix for 136.885 but I just get OSError: [Errno 28] No space left on device: '/cvmfs/cms-ib.cern.ch/week0/slc7_amd64_gcc10/cms/cmssw/CMSSW_12_1_UBSAN_X_2021-08-20-2300/python/PhysicsTools/PatAlgos/slimming/isolatedTracks_cfi.py'
This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
OSError: [Errno 28] No space left on device:
could you share what command you are using to run the relval and on what machine ?
This IB turns the Undefined behavior sanitizer (-fsanitize=undefined flag), which searches for such errors: https://clang.llvm.org/docs/UndefinedBehaviorSanitizer.html
OSError: [Errno 28] No space left on device:
could you share what command you are using to run the relval and on what machine ?
lxplus. Just (naively) installing cmsrel CMSSW_12_1_UBSAN_X_2021-08-20-2300
and then running runTheMatrix.py -l 136.885
would you try from lxplus, log onto cmsdev* machine and retry
ssh cmsdev25
mkdir -p /build/$USER ; cd /build/$USER
export SCRAM_ARCH=slc7_amd64_gcc10;
scram p CMSSW_12_1_UBSAN_X_2021-08-20-2300
cd CMSSW_12_1_UBSAN_X_2021-08-20-2300 ; cmsenv ;
runTheMatrix.py -l 136.885 -i all --ibeos
It took ages but I could run 136.885, confirming the problem for CTPPSPixelRecHits. With @jan-kaspar we worked out a possible solution: removing line https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L19 and adding default values here https://github.com/cms-sw/cmssw/blob/master/DataFormats/CTPPSReco/interface/CTPPSPixelRecHit.h#L21-L22
@slava77 Shall I submit a PR?
Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures
@slava77 Shall I submit a PR?
@fabferro yes, please. I think that for PPS hit multiplicities there is no need to worry about the cost of initializing a hit data if it will not be used.
Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures
That's expected for most cases in this and other reco-related UBSAN issues: the symptoms are always for uninitialized values (which are also apparently not used at least in a few cases, beyond a load). The uninit values are often random (and not repeatable).
Did you ran it multiple times until it presented itself ? I have troubles reproducing some of the reported failures
No. I ran it once before changing the code and the error was there, I ran once after the changes and the error disappeared. Actually, for some reason, it took hours to run 136.885, so it would have been impossible to repeate it several times...
159.3 step3
DataFormats/CaloTowers/interface/CaloTower.h:26:7: runtime error: load of value 967445059, which is not a valid value for type 'HcalSubdetector'
the lineno is not particularly helpful class CaloTower : public reco::LeafCandidate { I can guess that this is a caloTower with uninitialized HcalSubdetector subdet_ (created without a subsequent setHcalSubdet) @cms-sw/hcal-dpg-l2
@abdoulline Do you have an idea where this happens? Should we initialize this in the CaloTower constructor?
@mseidel42 I don't immediately see a reason of the issue, which might have been present there since > 10 years. It requires some investigation (via calotowermaker), which is beyond of what I can do at the moment from vacation. Hopefully I'll have a possibility to do it some time next week, still being on vacation.
NB: there seems to be yet another issue in CaloTower (along with the aforementioned 'HcalSubdetector' one): https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/159.3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO/step3_ZMM_14_HI_2021+ZMM_14_HI_2021+DIGIHI2021PPRECO+RECOHI2021PPRECO+HARVESTHI2021PPRECO.log
-> CaloTower.h:26:7: runtime error: load of value 32, which is not a valid value for type 'bool'
present there since > 10 years
yes this issues were existing for long time that's why cleaning them out will narrow consequent bugs that pop up for no obvious reasons. although for all bugs we know we have some idea non of which is related to this reports (afaik) in short now that we know the bugs are there lets do something
@slava77 for ME0 station we would like to keep it as 0.
@jshlee
please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that station -1
is not used in this case.
@emilbols @vmariani @mmusich
this is a kind ping on the pending issues as detailed in https://github.com/cms-sw/cmssw/issues/35033#issuecomment-906831823
@slava77 for ME0 station we would like to keep it as 0.
@jshlee please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that
station -1
is not used in this case.
hi @slava77, where is the code implemented?
@slava77 for ME0 station we would like to keep it as 0.
@jshlee please check the code implementation in DataFormats/TrackReco/src/HitPattern.cc so that
station -1
is not used in this case.hi @slava77, where is the code implemented?
@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.
I tried to run this on cmsdev25 following the recipe above with workflow 1302.17 and switching to CMSSW_12_1_UBSAN_X_2021-09-03-2300, but i dont get the error. If i understand the error correctly, it is most likely because it only appears when there are more than 1 SV in the event which the track is being associated to, so there is probably some randomness in whether you get it
@slava77 @mrodozov So if i understand correctly the error in 1302.17 appears because svStatus is set to a value beyond the enumeration range? Obviously this code was written a long time ago, but as far as i can see, this is the intended behavior. An svStatus=2 indicates a match to the vertex with index 0, svStatus=3 is a match with vertex with index 1, etc. . If we dont want to use enum objects like this, i guess the solution could be to change this object to an integer.
@emilbols thank you for checking the code and clarifying its intents in using the variable. I think that the design span of values should match the type. So, if a change to an int is needed, let's update it. It was not obvious to me in this case, but perhaps this case also implies that a new variable is needed instead instead of an "easy" overload of what was originally designed to be an enum.
As for reproducibility, you may want to enable more threads (the reference job used 4 threads) and add more events.
+reconstruction
the summary was updated at https://github.com/cms-sw/cmssw/issues/35033#issuecomment-906831823
The UBSAN IB reports undefined behavior in 5 files, with example relval and step they appear in:
check the relval logs in here for the examples: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/ubsan_logs/relvals/