cms-sw / cmssw

CMS Offline Software
http://cms-sw.github.io/
Apache License 2.0
1.08k stars 4.32k forks source link

Random comparison difference in HLT/SiStrip/ControlView #41200

Open makortel opened 1 year ago

makortel commented 1 year ago

The HLT/SiStrip/ControlView/{ClusterStoNCorr_OnTrack_FECCratevsFECSlot,ClusterStoNCorr_OnTrack_FECSlotVsFECRing_TECP} histograms showed differences in workflow 11634.911 in PR tests of https://github.com/cms-sw/cmssw/pull/41186#issuecomment-1483895309 . The PR itself is very unlikely to be the cause of the differences. The differences have also not been visible in other recent PR tests, so these differences have likely random origin. The purpose of this issue is to nevertheless document them, in case they are visible in other tests later on. image image

The 11634.911 is the DD4Hep workflow that, IIUC, reads the geometry from the XML file instead from the CondDB. These differences may be evidence of some rare non-reproducibility in DD4Hep code path (that we have observed, but not really solved, before).

makortel commented 1 year ago

assign geometry

cmsbuild commented 1 year ago

New categories assigned: geometry

@mdhildreth,@Dr15Jones,@makortel,@bsunanda,@civanch you have been requested to review this Pull request/Issue and eventually sign? Thanks

cmsbuild commented 1 year ago

A new Issue was created by @makortel Matti Kortelainen.

@Dr15Jones, @perrotta, @dpiparo, @rappoccio, @makortel, @smuzaffar can you please review it and eventually sign/assign? Thanks.

cms-bot commands are listed here

VinInn commented 1 year ago

is architecture dependence excluded (aka INTEL vs AMD)?

makortel commented 1 year ago

Good point. I checked the PR test and baseline runTheMatrix output of https://github.com/cms-sw/cmssw/pull/41186#issuecomment-1483895309, and both were run on Intel(R) Xeon(R) CPU E5-2683 v4 @ 2.10GHz.

VinInn commented 1 year ago

I'm running valgrind on step1 and see a bunch of these

==1882951== Invalid read of size 8
==1882951==    at 0x40F3AA2E: vecgeom::cxx::CommonUnplacedVolumeImplHelper<vecgeom::cxx::PolyhedronImplementation<(EInnerRadii)0, (EPhiCutout)0>, vecgeom::cxx::VUnplacedVolume>::SafetyToIn(vecgeom::cxx::Vector3D<double> const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40EF79F6: G4UAdapter<vecgeom::cxx::UnplacedPolyhedron>::DistanceToIn(CLHEP::Hep3Vector const&) const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40FC659C: G4VoxelNavigation::ComputeStep(CLHEP::Hep3Vector const&, CLHEP::Hep3Vector const&, double, double&, G4NavigationHistory&, bool&, CLHEP::Hep3Vector&, bool&, bool&, G4VPhysicalVolume**, int&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40CDF04A: G4Navigator::ComputeStep(CLHEP::Hep3Vector const&, CLHEP::Hep3Vector const&, double, double&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40E52FA6: G4Transportation::AlongStepGetPhysicalInteractionLength(G4Track const&, double, double, double&, G4GPILSelection*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40E4B87B: G4TrackingManager::ProcessOneTrack(G4Track*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40C38D19: G4EventManager::DoProcessing(G4Event*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40989BB9: RunManagerMTWorker::produce(edm::Event const&, edm::EventSetup const&, RunManagerMT&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x40995831: omt::ThreadHandoff::Functor<OscarMTProducer::produce(edm::Event&, edm::EventSetup const&)::{lambda()#1}>::execute() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x4097A919: omt::ThreadHandoff::threadLoop(void*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/biglib/el8_amd64_gcc11/pluginSimulation.so)
==1882951==    by 0x70861C9: start_thread (in /usr/lib64/libpthread-2.28.so)
==1882951==    by 0x72D7E72: clone (in /usr/lib64/libc-2.28.so)
==1882951==  Address 0x54664188 is 24 bytes before an unallocated block of size 0 in arena "client"
==1882951==

need to be understood if related specifically to DD4HEP

VinInn commented 1 year ago

my valgrind command

valgrind --leak-check=full --show-leak-kinds=all --track-origins=yes --tool=memcheck \
--suppressions=$ROOTSYS/etc/valgrind-root.supp \
--suppressions=$CMSSW_RELEASE_BASE/src/Utilities/ReleaseScripts/data/cms-valgrind-memcheck.supp cmsRun $1
VinInn commented 1 year ago

not sure if the valgrind report is actually still related to this https://sft.its.cern.ch/jira/projects/VECGEOM/issues/VECGEOM-600?filter=allopenissues

civanch commented 1 year ago

@VinInn , this issue was understood as a compiler bug when -O3 optimisation is used. The solution was to use -O2 optimisation for VecGeom. However, I am not sure if the problem Matti are reporting here is the same.

VinInn commented 1 year ago

I got the report above in latest 13_1_X nighty. Maybe understood, not solved apparently. VecGeom not vectorized is a bit incongruous...

VinInn commented 1 year ago

one more valgrind message in step2 (most probably for a different issue)

==1891808== Invalid free() / delete / delete[] / realloc()
==1891808==    at 0x403BF6C: free (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/external/valgrind/3.17.0-7ca83817e7379e83453f913e11e14834/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==1891808==    by 0x48F90DDB: edm::Wrapper<ZVertexSoAHeterogeneousHost<131072> >::~Wrapper() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libCUDADataFormatsVe
rtex.so)
==1891808==    by 0x48F90DF3: edm::Wrapper<ZVertexSoAHeterogeneousHost<131072> >::~Wrapper() (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libCUDADataFormatsVe
rtex.so)
==1891808==    by 0x4DD56FA: edm::productholderindexhelper::getContainedTypeFromWrapper(edm::TypeID const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-0277
8/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libDataFormatsProvenance.so)
==1891808==    by 0x4DDB31F: edm::ProductRegistry::initializeLookupTables(std::set<edm::TypeID, std::less<edm::TypeID>, std::allocator<edm::TypeID> > const*, std::set<edm::TypeID, std::less<edm::TypeID>, std::allocator<edm::TypeID> > const*,
std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libDataFormatsProvenance.
so)
==1891808==    by 0x4DD15EF: edm::ProductRegistry::setFrozen(std::set<edm::TypeID, std::less<edm::TypeID>, std::allocator<edm::TypeID> > const&, std::set<edm::TypeID, std::less<edm::TypeID>, std::allocator<edm::TypeID> > const&, std::__cxx11:
:basic_string<char, std::char_traits<char>, std::allocator<char> > const&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libDataFormatsProvenance.so)
==1891808==    by 0x4C161B6: edm::Schedule::finishSetup(edm::ParameterSet&, edm::service::TriggerNamesService const&, edm::ProductRegistry&, edm::BranchIDListHelper&, edm::ProcessBlockHelperBase&, edm::ThinnedAssociationsHelper&, edm::SubProc
essParentageHelper const*, std::shared_ptr<edm::ActivityRegistry>, std::shared_ptr<edm::ProcessConfiguration>, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64
_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libFWCoreFramework.so)
==1891808==    by 0x4C2676D: edm::ScheduleItems::finishSchedule(edm::ScheduleItems::MadeModules, edm::ParameterSet&, edm::service::TriggerNamesService const&, bool, edm::PreallocationConfiguration const&, edm::ProcessContext const*, edm::Proc
essBlockHelperBase&) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libFWCoreFramework.so)
==1891808==    by 0x4B68977: edm::EventProcessor::init(std::shared_ptr<edm::ProcessDesc>&, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13
_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libFWCoreFramework.so)
==1891808==    by 0x4B6BAD0: edm::EventProcessor::EventProcessor(std::shared_ptr<edm::ProcessDesc>, edm::ServiceToken const&, edm::serviceregistry::ServiceLegacy) (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch
/CMSSW_13_1_X_2023-03-28-1100/lib/el8_amd64_gcc11/libFWCoreFramework.so)
==1891808==    by 0x40C0AC: tbb::detail::d1::task_arena_function<main::{lambda()#1}::operator()() const::{lambda()#1}, void>::operator()() const (in /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02778/el8_amd64_gcc11/cms/cmssw-patch/CMSSW_13_1_X_2023
-03-28-1100/bin/el8_amd64_gcc11/cmsRun)
==1891808==    by 0x63D3846: tbb::detail::r1::task_arena_impl::execute(tbb::detail::d1::task_arena_base&, tbb::detail::d1::delegate_base&) (arena.cpp:694)
==1891808==  Address 0xaa644380 is in a rw- anonymous segment
==1891808==
makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmssw/pull/41274#issuecomment-1496498960, this time in workflow 12434.0

makortel commented 1 year ago

assign dqm

cmsbuild commented 1 year ago

New categories assigned: dqm

@micsucmed,@rvenditti,@emanueleusai,@syuvivida,@pmandrik you have been requested to review this Pull request/Issue and eventually sign? Thanks

makortel commented 1 year ago

FYI @cms-sw/trk-dpg-l2

makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmssw/pull/41328#issuecomment-1515595835, this time in workflow 12434.0

makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmssw/pull/41460#issuecomment-1530037180 in workflow 12434.0

makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmssw/pull/41876#issuecomment-1578065153 in workflow 12434.0

makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmsdist/pull/8545#issuecomment-1598402301 in workflow 12434.0 (although there because of an update of the compiler minor(?) differences in generated code can not be excluded)

makortel commented 1 year ago

Another occurrence in https://github.com/cms-sw/cmssw/pull/42075#issuecomment-1605091855 in workflow 12434.0.

mmusich commented 1 year ago

type trk