AIDASoft / DD4hep

Detector Description Toolkit for High Energy Physics
http://dd4hep.cern.ch
GNU Lesser General Public License v3.0
50 stars 99 forks source link

DDG4/src/Geant4ParticleHandler issue? Geant4Particle delete when ref count already < 0 #883

Closed wdconinc closed 2 years ago

wdconinc commented 2 years ago

Standard info:

We (EIC/ATHENA) encounter an issue that can be reproduced with the following minimal example, based on the hepmc input event dd4hep_memory_issue.hepmc and the standard SiD.xml detector (the issue is in the Geant4Particle handling):

  1. standard ddsim running: (segfaults with dd4hep_memory_issue.log)
    ddsim --runType batch -v INFO --numberOfEvents 1 --compactFile $(spack location -i dd4hep)/DDDetectors/compact/SiD.xml --inputFiles dd4hep_memory_issue.hepmc --outputFile dd4hep_memory_issue.root
  2. valgrind with malloc: (points to Geant4Particle pointer release, with dd4hep_memory_issue.log)
    PYTHONMALLOC=malloc valgrind --suppressions=$ROOTSYS/etc/valgrind-root.supp --suppressions=$ROOTSYS/etc/valgrind-root-python.supp python /opt/local/bin/ddsim --runType batch -v INFO --numberOfEvents 1 --compactFile $(spack location -i dd4hep)/DDDetectors/compact/SiD.xml --inputFiles dd4hep_memory_issue.hepmc --outputFile dd4hep_memory_issue.root

It seems that somewhere the reference counting goes wrong. We've identified one area that may be to blame, in Geant4ParticleHandler, where a pointer is added without adding to the reference count if p->reason&G4PARTICLE_PRIMARY) == G4PARTICLE_PRIMARY. The following patch appears may resolve the memory issue (though we are continuing to run tests):

diff --git a/DDG4/src/Geant4ParticleHandler.cpp b/DDG4/src/Geant4ParticleHandler.cpp
index 82a4b702..779dd044 100644
--- a/DDG4/src/Geant4ParticleHandler.cpp
+++ b/DDG4/src/Geant4ParticleHandler.cpp
@@ -436,11 +436,8 @@ void Geant4ParticleHandler::rebaseSimulatedTracks(int )   {
   for(count = 0, iend=pm.end(), i=pm.begin(); i!=iend; ++i)  {
     Particle* p = (*i).second;
     orgParticles[p->id] = p->id;
-    finalParticles[p->id] = p;
+    finalParticles[p->id] = p->addRef();
     if ( p->id > count ) count = p->id;
-    if ( (p->reason&G4PARTICLE_PRIMARY) != G4PARTICLE_PRIMARY )  {
-      p->addRef();
-    }
   }
   // (1.1) Define the new particle mapping for the simulated tracks
   for(++count, iend=m_particleMap.end(), i=m_particleMap.begin(); i!=iend; ++i)  {
@@ -448,7 +445,8 @@ void Geant4ParticleHandler::rebaseSimulatedTracks(int )   {
     if ( (p->reason&G4PARTICLE_PRIMARY) != G4PARTICLE_PRIMARY )  {
       //if ( orgParticles.find(p->id) == orgParticles.end() )  {
       orgParticles[p->id] = count;
-      finalParticles[count] = p;
+      finalParticles[count]->release();
+      finalParticles[count] = p->addRef();
       p->id = count;
       ++count;
     }

Note: This event is obtained with standard Pythia8 and HepMC3 methods, but a crab-crossing vertex smearing is added that is larger than usual (see here for the code that generated the vertex smearing, with the additional caveat that the first V vertex position/time was adopted as E vertex position/time to avoid a separate issue with physics.zeroTimePDGs which include electrons PDG IDs that we're also investigating).

andresailer commented 2 years ago

physics.zeroTimePDGs which include electrons PDG IDs that we're also investigating

The electron in this list doesn't make a difference, because "zero life time" entries for stable particles, like electrons, are anyway not passed to geant4.

andresailer commented 2 years ago

I think you need to add 2101 to the list of rejectPDGs

SIM.physics.rejectPDGs = {1, 2, 3, 4, 5, 6, 21, 23, 24, 25, 2101}

That solves the segmentation violation for me. While the particle is known to geant4, I don't think it can handle it correctly.

wdconinc commented 2 years ago

Thanks, this does indeed solve the segfault for me as well and will allow us to proceed.

andresailer commented 2 years ago

What helps to find this is increasing the verbosity level to VERBOSE (-v VERBOSE), for the particleHandler (--output.part VERBOSE), the input handling (--output.inputStage VERBOSE) or all of them, so that the event information is printed.

Maybe there is a more general way to reject particles instead of PDG codes, but this is all very subtle and what works for one generator might break another. It could also be a bug in geant4 that this particle is considered stable? Or in the MC generator that produces the hepMC file?

Ultimately this seems to be a memory management issue that may be worked around by adding 2101 to the list of rejectPDGs but that is bound to crop up again in some other context.

I don't know if the memory issue is just a symptom of giving something to geant4 that cannot be simulate properly, that is junk-in-segfault-out. That would also mean that fixing the memory issue is not going to give you the simulated event you want to simulate.

andresailer commented 2 years ago

Are these vertex positions are correct? Should V -1 not be at the same place as V -4?

V -1 0 [1] @ 6.5696136664797733e-02 4.7527434732507192e-03 -1.3890240802882960e+01 8.7364505051563430e+00
P 2 -1 2 3.1058421480785581e-01 -2.9024364446547496e-01 4.0568315156884552e+01 4.0570542251687357e+01 0.0000000000000000e+00 61
P 3 -1 2101 -2.8799956615742088e+00 3.1144058180296552e-01 5.9179327434312199e+01 5.9253015020951608e+01 5.7933000000000001e-01 63
P 4 0 11 9.5272783170785580e-04 -1.0310103923304300e-03 -1.0005739163283749e+01 1.0005739274809379e+01 5.1099999999999995e-04 4
[snip]
V -4 0 [3] @ 6.6654555522693129e-02 3.9907684113166374e-03 -1.3877659476671274e+01 8.7164402483168075e+00
P 7 -4 2101 -2.8734832780485902e+00 3.0626301898750879e-01 5.9264816592707227e+01 5.9338055345639276e+01 5.7933000000000001e-01 72
wdconinc commented 2 years ago

The vertex positions are as given by Pythia8, with a single constant global shift applied to account for interaction vertex distribution. The interaction vertex distribution is correlated with interaction kinematics through beam divergences, so this is done in the Pythia8 event generation. If we could factorize the vertex distribution from the physics, we would simply do that, but we can't.

andresailer commented 2 years ago

But shouldn't that position be the same for these two HepMC vertices? This is a single interaction, or not? PS: the distance is basically inside a proton

andresailer commented 2 years ago

Here is the 2101 particle defined in geant4 https://gitlab.cern.ch/geant4/geant4/-/blob/master/source/particles/shortlived/src/G4ShortLivedConstructor.cc#L198-204

andresailer commented 2 years ago

Thanks for the PR @wdconinc