cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.07k stars 4.29k forks source link

Muon does not reproduce #18605

Closed VinInn closed 7 years ago

VinInn commented 7 years ago

so in CMSSW_9_2_X_2017-05-06-1100 I run runTheMatrix.py --list=10007.0 -w upgrade --nThreads=8 --command "-n 1000" > & sm10.log & then one after the other

cmsRun step3_RAW2DIGI_L1Reco_RECO_EI_PAT_VALIDATION_DQM.py > & reco.log &
cmsRun step4_HARVESTING.py
mv DQM_V0001_R000000001__Global__CMSSW_X_Y_Z__RECO.root DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI1.root
cmsRun step3_RAW2DIGI_L1Reco_RECO_EI_PAT_VALIDATION_DQM.py > & reco2.log &
cmsRun step4_HARVESTING.py
mv DQM_V0001_R000000001__Global__CMSSW_X_Y_Z__RECO.root DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI2.root

and differences are visible (see below)

cmsbuild commented 7 years ago

A new Issue was created by @VinInn Vincenzo Innocente.

@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn commented 7 years ago

image

image

VinInn commented 7 years ago

image

image

VinInn commented 7 years ago

image

VinInn commented 7 years ago

@slava77 @perrotta @makortel @ebrondol @HuguesBrun

Dr15Jones commented 7 years ago

assign reconstruction

cmsbuild commented 7 years ago

New categories assigned: reconstruction

@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 7 years ago

Would it make sense to eventually run this kind of test as part of IB tests?

slava77 commented 7 years ago

@makortel @VinInn do you have a valgrind result for this job?

VinInn commented 7 years ago

On 8 May, 2017, at 3:49 PM, Slava Krutelyov notifications@github.com wrote:

@makortel @VinInn do you have a valgrind result for this job?

running at event 140

found only this
==75124== Thread 4:
==75124== Invalid read of size 4
==75124==    at 0x2AAD312B: JetCorrectorParametersHelper::binIndexN(std::vector<float, std::allocator<float> > const&, std::vector<JetCorrectorParameters::Record, std::allocator<JetCorr
ectorParameters::Record> > const&) const (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects.
so)
==75124==    by 0x2AAB33EC: SimpleJetCorrector::correction(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) const (in /cvmfs/cms-ib.
cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects.so)
==75124==    by 0x2AA955FF: FactorizedJetCorrectorCalculator::getSubCorrections(FactorizedJetCorrectorCalculator::VariableValues&) const (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64
_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects.so)
==75124==    by 0x2AAD212F: FactorizedJetCorrector::getSubCorrections() (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc6
30/libCondFormatsJetMETObjects.so)
==75124==    by 0x4C5A8201: pat::JetCorrFactorsProducer::evaluate(boost::iterators::indirect_iterator<__gnu_cxx::__normal_iterator<reco::Jet const* const*, std::vector<reco::Jet const*,
std::allocator<reco::Jet const*> > >, boost::iterators::use_default, boost::iterators::use_default, boost::iterators::use_default, boost::iterators::use_default>&, pat::JetCorrFactors:
:Flavor const&, int) (JetCorrFactorsProducer.cc:172)
==75124==    by 0x4C5A9BB2: pat::JetCorrFactorsProducer::produce(edm::Event&, edm::EventSetup const&) (JetCorrFactorsProducer.cc:265)
==75124==    by 0x4C501F2: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetup const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in 
/cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so)
==75124==    by 0x4C21441: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) (in /cvmfs/cms-
ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so)
==75124==    by 0x4B307C6: decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::Occu
rrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrinci
pal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<e
dm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::Bra
nchActionType)1>::Context const*)::{lambda()#1}) (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.s
o)
==75124==
VinInn commented 7 years ago

I would ask muon to 1) check if happens also in single-threaded 2) if not, search somewhere in muon seed creation for a producer data member that somehow carries history from one event to the next.... (and affects mostly the overlap region)

makortel commented 7 years ago

From my earlier debugging I didn't manage to find anything useful. It was also difficult to find a small set of events that would trigger the problem as the differences (dis)appeared randomly when number of events decreased to few tens of events. (all my tests except valgrind were done multi-threaded)

VinInn commented 7 years ago

Does valgrind really run MT? I see only one thread with top -H, even if cmssw is 8threads

slava77 commented 7 years ago

On 5/8/17 7:06 AM, Vincenzo Innocente wrote:

I would ask muon to

  1. check if happens also in single-threaded

I have not seen non-reproducible single-thread in my recent extended tests. So, it's more likely MT-related

  1. if not, search somewhere in muon seed creation for a producer data member that somehow carries history from one event to the next....

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299876392, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbgrV02m0swPfaxg5sQqbvpxhFKedks5r3yFLgaJpZM4NTLAv.

makortel commented 7 years ago

Does valgrind really run MT?

I don't remember if memcheck does (probably not), helgrind for sure not.

slava77 commented 7 years ago

@VinInn could you please run validateJR on the output that you have or post a full relmon comparison for the DQM file. Given that valgrind doesn't return anything relevant, maybe there is indeed some memory from past events. The more complete comparisons may help to see where the change happens first upstream.

VinInn commented 7 years ago

here are the dqm file

-rw-r--r--. 1 innocent zh 76180686 May  7 16:14 /afs/cern.ch/user/i/innocent/data/DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI1.root
-rw-r--r--. 1 innocent zh 76077752 May  7 16:32 /afs/cern.ch/user/i/innocent/data/DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI2.root

very sorry: I have overwritten the reco output

slava77 commented 7 years ago

Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant

wf10007_csc_nhits

This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.

@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)

ptcox commented 7 years ago

I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc. 

Tim

On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:

Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant

This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.

@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

slava77 commented 7 years ago

On 5/8/17 9:21 AM, ptcox wrote:

I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc.

Tim

Hi Tim The instructions at the top of the thread show that by rerunning RECO (step3) multiple times, the results are not reproducible. Digitization is not repeated in this case.

On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:

Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant

This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.

@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299916036, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbmrUoEvnTWVwk3VG4NjSh27_2FJHks5r30EigaJpZM4NTLAv.

ptcox commented 7 years ago

OK, thanks,. That's interesting. I have perhaps seen hints of this myself in relvals - I noticed in 910pre2 vs 910pre1 real data samples, where identical numbers of events are processed, that I saw identical digi and rechit distributions, but the segments seemed to show ratio plots that were not exactly 1.000. Tiny differences, but unexpected to me. Maybe there's some issue somehow in the CSC segment builder..

Regards, Tim

On May 8, 2017 at 18:30:18, Slava Krutelyov (notifications@github.com) wrote:

On 5/8/17 9:21 AM, ptcox wrote:

I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc.

Tim

Hi Tim The instructions at the top of the thread show that by rerunning RECO (step3) multiple times, the results are not reproducible. Digitization is not repeated in this case.

On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:

Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant

This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.

@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299916036, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbmrUoEvnTWVwk3VG4NjSh27_2FJHks5r30EigaJpZM4NTLAv.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

VinInn commented 7 years ago

not sure if in single threaded I add

std::vector<CSCSegment> CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) {
  std::cout << "chi2Norm_2D_  " << doCollisions << '  ' <<std::hexfloat << chi2Norm_2D_ << std::endl;

I see that in some events it starts with a different value...

TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602
chi2Norm_2D_  1 0x1.18p+5
--
TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977
chi2Norm_2D_  1 0x1.4p+4
--
TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187
chi2Norm_2D_  1 0x1.18p+5
--
TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008
chi2Norm_2D_  1 0x1.18p+5
-

and in a given event it starts with different values

TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395
....
TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.18p+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
chi2Norm_2D_  1 0x1.ep+5
TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408

maybe the logic in the loop

for (int ipass = 0; ipass < npass; ++ipass) 

is resilient still a bit strange (given that doCollisions is true at entrance)

davidlange6 commented 7 years ago

this class naively misses any sort of reset function for its internal data.. so this can have some interesting consequences (maybe by luck not).

Particularly:

https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/src/CSCSegAlgoRU.cc#L121

depends on some set of assumptions from the previous call that may or may not be valid

[and I notice some number of obsolete member data - theChi2, theOrigin, etc]

On May 12, 2017, at 3:37 PM, Vincenzo Innocente notifications@github.com wrote:

not sure if in single threaded I add

std::vector CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) { std::cout << "chi2Norm2D " << std::hexfloat << chi2Norm2D << std::endl;

I see that in some events it starts with a different value...

TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602 chi2Norm2D 1 0x1.18p+5

TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977 chi2Norm2D 1 0x1.4p+4

TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187 chi2Norm2D 1 0x1.18p+5

TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008 chi2Norm2D 1 0x1.18p+5

and

TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395 .... TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408

maybe the logic in the loop

for (int ipass = 0; ipass < npass; ++ipass)

is resilient still a bit strange (given that doCollisions is true at entrance)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

ptcox commented 7 years ago

Thanks. I'm forwarding to the developers Vladimir.Paltchik@cern.ch and nikolay.voytishin@cern.ch On May 12, 2017 at 16:28:19, David Lange (notifications@github.com) wrote:

this class naively misses any sort of reset function for its internal data.. so this can have some interesting consequences (maybe by luck not).

Particularly:

https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/src/CSCSegAlgoRU.cc#L121

depends on some set of assumptions from the previous call that may or may not be valid

[and I notice some number of obsolete member data - theChi2, theOrigin, etc]

On May 12, 2017, at 3:37 PM, Vincenzo Innocente notifications@github.com wrote:

not sure if in single threaded I add

std::vector CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) { std::cout << "chi2Norm2D " << std::hexfloat << chi2Norm2D << std::endl;

I see that in some events it starts with a different value...

TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602 chi2Norm2D 1 0x1.18p+5

TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977 chi2Norm2D 1 0x1.4p+4

TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187 chi2Norm2D 1 0x1.18p+5

TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008 chi2Norm2D 1 0x1.18p+5

and

TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395 .... TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408

maybe the logic in the loop

for (int ipass = 0; ipass < npass; ++ipass)

is resilient still a bit strange (given that doCollisions is true at entrance)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

VinInn commented 7 years ago

I would suggest to have a finite number of params as instances of a struct (in an array for instance) start ALWAYS with say "auto current = &params[0]'" and then change "current" according to whatever logic this algo implements.... try to keep "current" local and all algo attribute const. Actually this should be normal practice: producer and algo attributes should be all const. any "transient" value should be passed in the function calls

mind: flip-flopping with * and / may not produce always the previous number and the value of the parameters may easily start drifting....

VinInn commented 7 years ago

I would like to stress that this issue is a BLOCKER. no large scale production (involving reco) can start until this is fixed. There will also be the suspicion that any discrepancy in any muon related validation could have been causes by it.

VinInn commented 7 years ago

@cerminar, please PPD take note

davidlange6 commented 7 years ago

we can just revert #17543...lets plan to do that for 9_1_0 unless fixed.

On May 13, 2017, at 9:01 AM, Vincenzo Innocente notifications@github.com wrote:

I would like to stress that this issue is a BLOCKER. no large scale production (involving reco) can start until this is fixed. There will also be the suspicion that any discrepancy in any muon related validation could have been causes by it.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

VinInn commented 7 years ago

I am wondering how it got such a friendly review with so many obvious issues in the code (at least compared to other cases where even comments are argued about...)

davidlange6 commented 7 years ago

the issues appear to predate the PR (the PR activates them..)

On May 13, 2017, at 12:02 PM, Vincenzo Innocente notifications@github.com wrote:

I am wondering how it got such a friendly review with so many obvious issues in the code (at least compared to other cases where even comments are argued about...)

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

slava77 commented 7 years ago

@ptcox @nvoytish did you have a chance to follow up on this issue? current "easy" alternative is to go back to the old CSC segment algorithm.

nvoytish commented 7 years ago

Hi all!

The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.

Best regards, Nikolay

ptcox commented 7 years ago

Hi Nikolay,

Even if the chi2norm2D values are OK, see the plot of the no. of hits per segment in ME1/3 that Slava included in the thread. It should be identical in both reco passes, of course. I see similar problems in the regular relvals when the results from real data samples should be identical. They ARE for Digis and Rechits, and APPEAR to be for Segments. But when examining the ratios of the segment plots I see occasional bins where the ratio is not exactly 1. The ratios are exactly 1 for Digis and Rechits, and for the old segment builder the Segment ratios were exactly 1, for all bins. So something is not right somewhere.

Can you try running on some real data sample a few times and compare the segments and their properties? I think we're all worried that there will be a few differences arising just from repeated running of the identical code.

Regards,

Tim

P.S. Slava, David, Vincenzo - if any of you have hard evidence of a real problem in the code, please let Nikolay know. It was not clear to me from the github thread.

On May 17, 2017, at 8:45 AM, nvoytish notifications@github.com<mailto:notifications@github.com> wrote:

Hi all!

The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.

Best regards, Nikolay

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/18605#issuecomment-302000768, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2Fnmpgld85yAGEmi-KZ6mnV5ogqAalks5r6pdpgaJpZM4NTLAv.

davidlange6 commented 7 years ago

On May 17, 2017, at 9:50 AM, ptcox notifications@github.com wrote:

Hi Nikolay,

Even if the chi2norm2D values are OK, see the plot of the no. of hits per segment in ME1/3 that Slava included in the thread. It should be identical in both reco passes, of course. I see similar problems in the regular relvals when the results from real data samples should be identical. They ARE for Digis and Rechits, and APPEAR to be for Segments. But when examining the ratios of the segment plots I see occasional bins where the ratio is not exactly 1. The ratios are exactly 1 for Digis and Rechits, and for the old segment builder the Segment ratios were exactly 1, for all bins. So something is not right somewhere.

Can you try running on some real data sample a few times and compare the segments and their properties? I think we're all worried that there will be a few differences arising just from repeated running of the identical code.

Regards,

Tim

P.S. Slava, David, Vincenzo - if any of you have hard evidence of a real problem in the code, please let Nikolay know. It was not clear to me from the github thread.

I think we've pointed out one clear problem, yes. But probably better that the experts evaluate how things do or don't get properly reset from event to event in the CSCSegAlgoRU code.

On May 17, 2017, at 8:45 AM, nvoytish notifications@github.com<mailto:notifications@github.com> wrote:

Hi all!

The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.

Best regards, Nikolay

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/18605#issuecomment-302000768, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2Fnmpgld85yAGEmi-KZ6mnV5ogqAalks5r6pdpgaJpZM4NTLAv. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nvoytish commented 7 years ago

Hi, Tim

I run the code on 400 events from Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.

VinInn commented 7 years ago

On 17 May, 2017, at 10:40 AM, nvoytish notifications@github.com wrote:

I run the code on 400 Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.

Run with 8 threads please. v.

davidlange6 commented 7 years ago

did you use multiple threads (such that the order of events processed changes)?

On May 17, 2017, at 10:40 AM, nvoytish notifications@github.com wrote:

I run the code on 400 Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.

nvoytish commented 7 years ago

I redid the test with 8 threads - same results, but this can be caused by small statistics (400 events).

Dr15Jones commented 7 years ago

What configuration parameters did you use to setup the threading?

nvoytish commented 7 years ago

Isn't "cmsRun -n 8 MYconfig.py" enough? I am not quite informed about how to use multithreading in cmssw.

Dr15Jones commented 7 years ago

I'm afraid that isn't sufficient. That makes cmsRun use 8 threads but it does not increase the number of concurrent events it will use. It is the concurrent events which will change the processing order.

To look for the effect, you need to edit your configuration and add

process.options = cms.untracked.PSet(
                    numberOfThreads = cms.untracked.uint32(8),
                    numberOfStreams = cms.untracked.uint32(0)
)                 

assuming process.options has not already been created in your configuration. If it has you can do

process.options.numberOfThreads = cms.untracked.uint32(8)
process.options.numberOfStreams = cms.untracked.uint32(0)
VinInn commented 7 years ago

check that cmsRun prints

%MSG-i ThreadSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # threads 8
%MSG
%MSG-i StreamSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # streams 8
%MSG
nvoytish commented 7 years ago

check that cmsRun prints

%MSG-i ThreadSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events setting # threads 8 %MSG

it printed exactly this message

VinInn commented 7 years ago

the relevant one is

%MSG-i StreamSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # streams 8
%MSG
nvoytish commented 7 years ago

I got the different results with the advices that @Dr15Jones suggested. I'll try to fix this.

VinInn commented 7 years ago

any news?

nvoytish commented 7 years ago

The initialization of the params is fixed, but I still sometimes get different chi2 values of a segment candidate in the same event. It looks like I am using CSCSegFit not in the right way. I am working on this.

@davidlange6 mentioned:

and I notice some number of obsolete member data - theChi2, theOrigin

can anybody tell me what is wrong with that? this means that these can not be used anymore?

nvoytish commented 7 years ago

about theChi2, theOrigin - my bad. I really don't use them anymore after switching to CSCSegFit. These variables will be omitted from the code.

nvoytish commented 7 years ago

Hi all!

I ran the fixed code on 2016H collisions - no difference (interactive and crab jobs). What are the further steps?

davidlange6 commented 7 years ago

make a pull request against the cmssw master branch. thanks

On May 31, 2017, at 12:47 PM, nvoytish notifications@github.com wrote:

Hi all!

I ran the fixed code on 2016H collisions - no difference (interactive and crab jobs). What are the further steps?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.