Closed VinInn closed 7 years ago
A new Issue was created by @VinInn Vincenzo Innocente.
@davidlange6, @Dr15Jones, @smuzaffar can you please review it and eventually sign/assign? Thanks.
cms-bot commands are listed here
@slava77 @perrotta @makortel @ebrondol @HuguesBrun
assign reconstruction
New categories assigned: reconstruction
@slava77,@perrotta you have been requested to review this Pull request/Issue and eventually sign? Thanks
Would it make sense to eventually run this kind of test as part of IB tests?
@makortel @VinInn do you have a valgrind result for this job?
On 8 May, 2017, at 3:49 PM, Slava Krutelyov notifications@github.com wrote:
@makortel @VinInn do you have a valgrind result for this job?
running at event 140
found only this ==75124== Thread 4: ==75124== Invalid read of size 4 ==75124== at 0x2AAD312B: JetCorrectorParametersHelper::binIndexN(std::vector<float, std::allocator<float> > const&, std::vector<JetCorrectorParameters::Record, std::allocator<JetCorr ectorParameters::Record> > const&) const (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects. so) ==75124== by 0x2AAB33EC: SimpleJetCorrector::correction(std::vector<float, std::allocator<float> > const&, std::vector<float, std::allocator<float> > const&) const (in /cvmfs/cms-ib. cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects.so) ==75124== by 0x2AA955FF: FactorizedJetCorrectorCalculator::getSubCorrections(FactorizedJetCorrectorCalculator::VariableValues&) const (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64 _gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libCondFormatsJetMETObjects.so) ==75124== by 0x2AAD212F: FactorizedJetCorrector::getSubCorrections() (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc6 30/libCondFormatsJetMETObjects.so) ==75124== by 0x4C5A8201: pat::JetCorrFactorsProducer::evaluate(boost::iterators::indirect_iterator<__gnu_cxx::__normal_iterator<reco::Jet const* const*, std::vector<reco::Jet const*, std::allocator<reco::Jet const*> > >, boost::iterators::use_default, boost::iterators::use_default, boost::iterators::use_default, boost::iterators::use_default>&, pat::JetCorrFactors: :Flavor const&, int) (JetCorrFactorsProducer.cc:172) ==75124== by 0x4C5A9BB2: pat::JetCorrFactorsProducer::produce(edm::Event&, edm::EventSetup const&) (JetCorrFactorsProducer.cc:265) ==75124== by 0x4C501F2: edm::stream::EDProducerAdaptorBase::doEvent(edm::EventPrincipal const&, edm::EventSetup const&, edm::ActivityRegistry*, edm::ModuleCallingContext const*) (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so) ==75124== by 0x4C21441: edm::WorkerT<edm::stream::EDProducerAdaptorBase>::implDo(edm::EventPrincipal const&, edm::EventSetup const&, edm::ModuleCallingContext const*) (in /cvmfs/cms- ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.so) ==75124== by 0x4B307C6: decltype ({parm#1}()) edm::convertException::wrap<bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::Occu rrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrinci pal, (edm::BranchActionType)1>::Context const*)::{lambda()#1}>(bool edm::Worker::runModule<edm::OccurrenceTraits<edm::EventPrincipal, (edm::BranchActionType)1> >(edm::OccurrenceTraits<e dm::EventPrincipal, (edm::BranchActionType)1>::MyPrincipal const&, edm::EventSetup const&, edm::StreamID, edm::ParentContext const&, edm::OccurrenceTraits<edm::EventPrincipal, (edm::Bra nchActionType)1>::Context const*)::{lambda()#1}) (in /cvmfs/cms-ib.cern.ch/nweek-02470/slc7_amd64_gcc630/cms/cmssw/CMSSW_9_2_X_2017-05-06-1100/lib/slc7_amd64_gcc630/libFWCoreFramework.s o) ==75124==
I would ask muon to 1) check if happens also in single-threaded 2) if not, search somewhere in muon seed creation for a producer data member that somehow carries history from one event to the next.... (and affects mostly the overlap region)
From my earlier debugging I didn't manage to find anything useful. It was also difficult to find a small set of events that would trigger the problem as the differences (dis)appeared randomly when number of events decreased to few tens of events. (all my tests except valgrind were done multi-threaded)
Does valgrind really run MT? I see only one thread with top -H, even if cmssw is 8threads
On 5/8/17 7:06 AM, Vincenzo Innocente wrote:
I would ask muon to
- check if happens also in single-threaded
I have not seen non-reproducible single-thread in my recent extended tests. So, it's more likely MT-related
- if not, search somewhere in muon seed creation for a producer data member that somehow carries history from one event to the next....
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299876392, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbgrV02m0swPfaxg5sQqbvpxhFKedks5r3yFLgaJpZM4NTLAv.
Does valgrind really run MT?
I don't remember if memcheck does (probably not), helgrind for sure not.
@VinInn could you please run validateJR on the output that you have or post a full relmon comparison for the DQM file. Given that valgrind doesn't return anything relevant, maybe there is indeed some memory from past events. The more complete comparisons may help to see where the change happens first upstream.
here are the dqm file
-rw-r--r--. 1 innocent zh 76180686 May 7 16:14 /afs/cern.ch/user/i/innocent/data/DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI1.root
-rw-r--r--. 1 innocent zh 76077752 May 7 16:32 /afs/cern.ch/user/i/innocent/data/DQM_V0001_R000000001__RelValMuPt10__CMSSW_9_2_X__ORI2.root
very sorry: I have overwritten the reco output
Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant
This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.
@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)
I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc.
Tim
On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:
Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant
This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.
@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
On 5/8/17 9:21 AM, ptcox wrote:
I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc.
Tim
Hi Tim The instructions at the top of the thread show that by rerunning RECO (step3) multiple times, the results are not reproducible. Digitization is not repeated in this case.
On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:
Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant
This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.
@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299916036, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbmrUoEvnTWVwk3VG4NjSh27_2FJHks5r30EigaJpZM4NTLAv.
OK, thanks,. That's interesting. I have perhaps seen hints of this myself in relvals - I noticed in 910pre2 vs 910pre1 real data samples, where identical numbers of events are processed, that I saw identical digi and rechit distributions, but the segments seemed to show ratio plots that were not exactly 1.000. Tiny differences, but unexpected to me. Maybe there's some issue somehow in the CSC segment builder..
Regards, Tim
On May 8, 2017 at 18:30:18, Slava Krutelyov (notifications@github.com) wrote:
On 5/8/17 9:21 AM, ptcox wrote:
I can't tell what this thread is about, even by reading it on github. I need more information to make any contribution. The start of the thread refers to SingleMuPt10. CSC digitization always changes between runs - there has never been any effort put into fixing random seeds etc.
Tim
Hi Tim The instructions at the top of the thread show that by rerunning RECO (step3) multiple times, the results are not reproducible. Digitization is not repeated in this case.
On May 8, 2017 at 18:10:28, Slava Krutelyov (notifications@github.com) wrote:
Looking upstream (hopefully not ignoring important random differences due to DQM non-reproducibilities) the CSC segment plots look relevant
This monitoring code knows only about CSC local reco objects. Clearly, if CSC segments change, the downstream objects, like muons will be affected.
@ptcox you may be interested to also follow this thread. Given lack of issues from valgrind, there may be some event data caching or similar that makes the execution depend on the order of events processed (multithreaded tests used in this issue lead to randomization of the processed event order)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/cms-sw/cmssw/issues/18605#issuecomment-299916036, or mute the thread https://github.com/notifications/unsubscribe-auth/AEdcbmrUoEvnTWVwk3VG4NjSh27_2FJHks5r30EigaJpZM4NTLAv.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
not sure if in single threaded I add
std::vector<CSCSegment> CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) {
std::cout << "chi2Norm_2D_ " << doCollisions << ' ' <<std::hexfloat << chi2Norm_2D_ << std::endl;
I see that in some events it starts with a different value...
TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602
chi2Norm_2D_ 1 0x1.18p+5
--
TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977
chi2Norm_2D_ 1 0x1.4p+4
--
TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187
chi2Norm_2D_ 1 0x1.18p+5
--
TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008
chi2Norm_2D_ 1 0x1.18p+5
-
and in a given event it starts with different values
TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395
....
TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.18p+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
chi2Norm_2D_ 1 0x1.ep+5
TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408
maybe the logic in the loop
for (int ipass = 0; ipass < npass; ++ipass)
is resilient still a bit strange (given that doCollisions is true at entrance)
this class naively misses any sort of reset function for its internal data.. so this can have some interesting consequences (maybe by luck not).
Particularly:
depends on some set of assumptions from the previous call that may or may not be valid
[and I notice some number of obsolete member data - theChi2, theOrigin, etc]
On May 12, 2017, at 3:37 PM, Vincenzo Innocente notifications@github.com wrote:
not sure if in single threaded I add
std::vector
CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) { std::cout << "chi2Norm2D " << std::hexfloat << chi2Norm2D << std::endl; I see that in some events it starts with a different value...
TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602 chi2Norm2D 1 0x1.18p+5
TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977 chi2Norm2D 1 0x1.4p+4
TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187 chi2Norm2D 1 0x1.18p+5
TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008 chi2Norm2D 1 0x1.18p+5
and
TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395 .... TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408
maybe the logic in the loop
for (int ipass = 0; ipass < npass; ++ipass)
is resilient still a bit strange (given that doCollisions is true at entrance)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Thanks. I'm forwarding to the developers Vladimir.Paltchik@cern.ch and nikolay.voytishin@cern.ch On May 12, 2017 at 16:28:19, David Lange (notifications@github.com) wrote:
this class naively misses any sort of reset function for its internal data.. so this can have some interesting consequences (maybe by luck not).
Particularly:
depends on some set of assumptions from the previous call that may or may not be valid
[and I notice some number of obsolete member data - theChi2, theOrigin, etc]
On May 12, 2017, at 3:37 PM, Vincenzo Innocente notifications@github.com wrote:
not sure if in single threaded I add
std::vector
CSCSegAlgoRU::buildSegments(const ChamberHitContainer& urechits) { std::cout << "chi2Norm2D " << std::hexfloat << chi2Norm2D << std::endl; I see that in some events it starts with a different value...
TimeModule> 12 1 csc2DRecHits CSCRecHitDProducer 0.00110602 chi2Norm2D 1 0x1.18p+5
TimeModule> 6 1 csc2DRecHits CSCRecHitDProducer 0.000298977 chi2Norm2D 1 0x1.4p+4
TimeModule> 5 1 csc2DRecHits CSCRecHitDProducer 0.000840187 chi2Norm2D 1 0x1.18p+5
TimeModule> 13 1 csc2DRecHits CSCRecHitDProducer 0.00111008 chi2Norm2D 1 0x1.18p+5
and
TimeModule> 2101 1 csc2DRecHits CSCRecHitDProducer 0.292553 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2101 1 cscSegments CSCSegmentProducer 0.00168395 .... TimeModule> 2102 1 csc2DRecHits CSCRecHitDProducer 0.00205708 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.18p+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 chi2Norm2D 1 0x1.ep+5 TimeModule> 2102 1 cscSegments CSCSegmentProducer 0.00195408
maybe the logic in the loop
for (int ipass = 0; ipass < npass; ++ipass)
is resilient still a bit strange (given that doCollisions is true at entrance)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I would suggest to have a finite number of params as instances of a struct (in an array for instance) start ALWAYS with say "auto current = ¶ms[0]'" and then change "current" according to whatever logic this algo implements.... try to keep "current" local and all algo attribute const. Actually this should be normal practice: producer and algo attributes should be all const. any "transient" value should be passed in the function calls
mind: flip-flopping with * and / may not produce always the previous number and the value of the parameters may easily start drifting....
I would like to stress that this issue is a BLOCKER. no large scale production (involving reco) can start until this is fixed. There will also be the suspicion that any discrepancy in any muon related validation could have been causes by it.
@cerminar, please PPD take note
we can just revert #17543...lets plan to do that for 9_1_0 unless fixed.
On May 13, 2017, at 9:01 AM, Vincenzo Innocente notifications@github.com wrote:
I would like to stress that this issue is a BLOCKER. no large scale production (involving reco) can start until this is fixed. There will also be the suspicion that any discrepancy in any muon related validation could have been causes by it.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I am wondering how it got such a friendly review with so many obvious issues in the code (at least compared to other cases where even comments are argued about...)
the issues appear to predate the PR (the PR activates them..)
On May 13, 2017, at 12:02 PM, Vincenzo Innocente notifications@github.com wrote:
I am wondering how it got such a friendly review with so many obvious issues in the code (at least compared to other cases where even comments are argued about...)
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
@ptcox @nvoytish did you have a chance to follow up on this issue? current "easy" alternative is to go back to the old CSC segment algorithm.
Hi all!
The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.
Best regards, Nikolay
Hi Nikolay,
Even if the chi2norm2D values are OK, see the plot of the no. of hits per segment in ME1/3 that Slava included in the thread. It should be identical in both reco passes, of course. I see similar problems in the regular relvals when the results from real data samples should be identical. They ARE for Digis and Rechits, and APPEAR to be for Segments. But when examining the ratios of the segment plots I see occasional bins where the ratio is not exactly 1. The ratios are exactly 1 for Digis and Rechits, and for the old segment builder the Segment ratios were exactly 1, for all bins. So something is not right somewhere.
Can you try running on some real data sample a few times and compare the segments and their properties? I think we're all worried that there will be a few differences arising just from repeated running of the identical code.
Regards,
Tim
P.S. Slava, David, Vincenzo - if any of you have hard evidence of a real problem in the code, please let Nikolay know. It was not clear to me from the github thread.
On May 17, 2017, at 8:45 AM, nvoytish notifications@github.com<mailto:notifications@github.com> wrote:
Hi all!
The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.
Best regards, Nikolay
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/18605#issuecomment-302000768, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2Fnmpgld85yAGEmi-KZ6mnV5ogqAalks5r6pdpgaJpZM4NTLAv.
On May 17, 2017, at 9:50 AM, ptcox notifications@github.com wrote:
Hi Nikolay,
Even if the chi2norm2D values are OK, see the plot of the no. of hits per segment in ME1/3 that Slava included in the thread. It should be identical in both reco passes, of course. I see similar problems in the regular relvals when the results from real data samples should be identical. They ARE for Digis and Rechits, and APPEAR to be for Segments. But when examining the ratios of the segment plots I see occasional bins where the ratio is not exactly 1. The ratios are exactly 1 for Digis and Rechits, and for the old segment builder the Segment ratios were exactly 1, for all bins. So something is not right somewhere.
Can you try running on some real data sample a few times and compare the segments and their properties? I think we're all worried that there will be a few differences arising just from repeated running of the identical code.
Regards,
Tim
P.S. Slava, David, Vincenzo - if any of you have hard evidence of a real problem in the code, please let Nikolay know. It was not clear to me from the github thread.
I think we've pointed out one clear problem, yes. But probably better that the experts evaluate how things do or don't get properly reset from event to event in the CSCSegAlgoRU code.
On May 17, 2017, at 8:45 AM, nvoytish notifications@github.com<mailto:notifications@github.com> wrote:
Hi all!
The chi2Norm2D parameter differs for diiferent CSC chambers (see https://github.com/cms-sw/cmssw/blob/09c3fce6626f70fd04223e7dacebf0b485f73f54/RecoLocalMuon/CSCSegment/python/CSCSegmentAlgorithmRU_cfi.py) so there is nothing wrong when the value differs from one event to enother.
Best regards, Nikolay
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHubhttps://github.com/cms-sw/cmssw/issues/18605#issuecomment-302000768, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AE2Fnmpgld85yAGEmi-KZ6mnV5ogqAalks5r6pdpgaJpZM4NTLAv. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
Hi, Tim
I run the code on 400 events from Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.
On 17 May, 2017, at 10:40 AM, nvoytish notifications@github.com wrote:
I run the code on 400 Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.
Run with 8 threads please. v.
did you use multiple threads (such that the order of events processed changes)?
On May 17, 2017, at 10:40 AM, nvoytish notifications@github.com wrote:
I run the code on 400 Collisions 2016H dataset twice. My root files with lots of hists and the log files are identical in file length and content.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
I redid the test with 8 threads - same results, but this can be caused by small statistics (400 events).
What configuration parameters did you use to setup the threading?
Isn't "cmsRun -n 8 MYconfig.py" enough? I am not quite informed about how to use multithreading in cmssw.
I'm afraid that isn't sufficient. That makes cmsRun use 8 threads but it does not increase the number of concurrent events it will use. It is the concurrent events which will change the processing order.
To look for the effect, you need to edit your configuration and add
process.options = cms.untracked.PSet(
numberOfThreads = cms.untracked.uint32(8),
numberOfStreams = cms.untracked.uint32(0)
)
assuming process.options
has not already been created in your configuration. If it has you can do
process.options.numberOfThreads = cms.untracked.uint32(8)
process.options.numberOfStreams = cms.untracked.uint32(0)
check that cmsRun prints
%MSG-i ThreadSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # threads 8
%MSG
%MSG-i StreamSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # streams 8
%MSG
check that cmsRun prints
%MSG-i ThreadSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events setting # threads 8 %MSG
it printed exactly this message
the relevant one is
%MSG-i StreamSetup: (NoModuleName) 16-May-2017 11:23:45 CEST pre-events
setting # streams 8
%MSG
I got the different results with the advices that @Dr15Jones suggested. I'll try to fix this.
any news?
The initialization of the params is fixed, but I still sometimes get different chi2 values of a segment candidate in the same event. It looks like I am using CSCSegFit not in the right way. I am working on this.
@davidlange6 mentioned:
and I notice some number of obsolete member data - theChi2, theOrigin
can anybody tell me what is wrong with that? this means that these can not be used anymore?
about theChi2, theOrigin - my bad. I really don't use them anymore after switching to CSCSegFit. These variables will be omitted from the code.
Hi all!
I ran the fixed code on 2016H collisions - no difference (interactive and crab jobs). What are the further steps?
make a pull request against the cmssw master branch. thanks
On May 31, 2017, at 12:47 PM, nvoytish notifications@github.com wrote:
Hi all!
I ran the fixed code on 2016H collisions - no difference (interactive and crab jobs). What are the further steps?
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub, or mute the thread.
so in CMSSW_9_2_X_2017-05-06-1100 I run
runTheMatrix.py --list=10007.0 -w upgrade --nThreads=8 --command "-n 1000" > & sm10.log &
then one after the otherand differences are visible (see below)