AIDASoft / DD4hep

Detector Description Toolkit for High Energy Physics
http://dd4hep.cern.ch
GNU Lesser General Public License v3.0
47 stars 93 forks source link

Memory Consumption of Sensitive Detectors #1285

Open s6anloes opened 3 weeks ago

s6anloes commented 3 weeks ago

Check duplicate issues.

Goal

I'm trying to understand the memory usage of sensitive volumes in dd4hep. I have a detector with a large number of sensitive volumes which seem to have a large impact on the memory consumption. More details are given below.

Operating System and Version

Centos 7

compiler

GCC 12.2.0

ROOT Version

6.28/10

DD4hep Version

1.28

Reproducer

To install the dual-readout calorimeter geometry: Note: the export command will need to be executed in each new shell

source /cvmfs/sw.hsf.org/key4hep/setup.sh
git clone --single-branch --branch dd4hep_github_issue https://github.com/s6anloes/DDDRCaloTubes.git
cd DDDRCaloTubes
mkdir build && cd build
cmake -DCMAKE_INSTALL_PREFIX=../install/ ..
make install -j6
cd ../install/
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$PWD/lib64

To run the simulation with all fibres marked as sensitive detector and monitor the memory usage via htop: Note: With the full geometr, this will take ~10GB of memory and about 10 minutes to build the geometry (at around 3 1/2 minutes you should be able to see the memory usage increase gradually)

cd ../DRdetector/DRcalo/compact
ddsim --compactFile DDDRCaloTubes.xml -N 1 -G --steeringFile steering.py --outputFile=test.root --part.userParticleHandler="" &
htop

Then to run the simulation without fibres being sensitive: In the DDDRCaloTibes.xml file in lines 333 and 335 change the "sensitive" value to false and run with the same command. This should take less than 1GB of memory. Note: the simulation will take slightly longer, because optical photons are propagated instead of killed like in the custom sensitive detector action

Additional context

I have been trying to improve the memory consumption of the calorimeter for some time now and had meetings with and feedback from some of the experts. In a recent FCC Full Sim Working Group meeting I presented some studies I did on the memory consumption. Mainly I show the CPU and memory usage as function of time using the psrecord software. There you can also find one slide on the geometry and volume hierarchy of the detector.

The slides are mainly about trying different options to improve the memory consumption, but one important point was also the discrepancy in memory consumption between the ddsim and dd4hep2root commands. While running ddsim takes 10GB of memory, dd4hep2root takes less than 1GB.

In this meeting, a colleague working on the 'monolithic' version of the geometry, suggested to run the simulation without having any volume marked as sensitive. And indeed, this seems to be the cause for the high discrepancy. A colleague said that in Geant4 the sensitive volume is linked to the logical volume, so even if the volume is placed many times (as is the case for the fibres in my geometry), there is still just one sensitive volume. It looks like this is not the case in dd4hep, where it seems that the sensitive volume is tied to the placed volume.

I wonder if this is something that can be changed, as it causes a problem with geometries with many small sensitive volumes.

Sidenote: this issue is somewhat related to issue #1173, where Sarah Eno was also looking at the memory consumption of the dual-readout calorimeter. While I think there is still some optimisation possible for my geometry (I'm not using all possible symmetries at the moment), this doesn't seem to be the main problem here

andresailer commented 3 weeks ago

Hi @s6anloes ,

Can you reduce your example to something that still shows the scaling behaviour, but can be run in a much shorter time , and use not more than 1GB of memory for example?

Thanks, Andre

s6anloes commented 3 weeks ago

Hi Andre, yes I can do this by reducing the number of towers placed. I have uploaded this to a branch called dd4hep_github_issue. It should take just around 1.2GB now and be done in two minutes. I have updated the Reproducer to clone this branch only.

While doing this I have found something, that might be of interest:

When thinking about how to make a smaller scale example for you to test, there are essentially two ways to place just a small number of towers

  1. Placing only one stave (fixed phi) over the full theta range. The geometry would look something like this (tower size increased for visualisation) stave
  2. Placing only one (or two) towers in the stave (fixed theta effectively) and repeatedly placing it in phi, such that you end up with a ring of towers, like this: ring

There is one significant difference between the two scenarios: The towers within one stave (so towers for different theta) are slightly different from one another geometry-wise (except for a forward backward symmetry at eta=0 (theta=90 deg). The enveloping trapezoid shape is for sure different for each tower. However, the tubes and fibres within the various towers are a bit of a different story. I try to reuse the volumes for the tubes and fibres throughout the simulation, by creating them once and storing them in a map. If a tube of a given length needs to be placed, I first check if this volume already exists in the map and if so, place it in the tower. For instance, we expect at the centre of each tower the tubes to be all of same length across the towers because the tubes reach all the way from the back to the front face. Only on the sides (the wings of the tower) where we need to stagger to tubes to get the overall shape, we expect the lengths to differ from one tower to the next.

How is this important?

Let's first look at Scenario 2, placing a ring of towers. When using 1deg by 1deg towers, we are placing one single stave volume 360 times in different phi rotations. The volumes are all identical. This is the geometry I have now pushed to the new branch for you to test. And from the plots below, you can see that there is still a significant difference between running the simulation with sensitive volumes or without. Running with sensitive volumes: plot_ddsim_onelayer

Running without sensitive volumes: plot_ddsim_nosens_onelayer

You can see from the blue line that after the geometry has been converted to Geant4 the memory rises much higher for the case with sensitive volumes (to a smaller scale now than with the full geometry).

In Scenario 1 (one full stave) things look different though. If you compare running with and without sensitive volumes, the memory consumption is the same. Running with sensitve volumes: plot_ddsim_onestave

Running without : plot_ddsim_nosens_onestave

So what is different now? Well, probably that the towers are all different volumes. So the issue might be related to whether or not this volume already exists in the memory. But in this case it would still be surprising though to see no difference, since the volumes for the tubes and fibres are reused and therefore should already exist in the memory. Also it's wrong to say that all towers are different, because of the symmetry at theta=90deg. This should contribute by a factor of two, since the tower is created once and placed twice within the stave. Only difference is the position and rotation.

I don't really know what this means however, or if this is even relevant to the underlying issue. I just thought I should share this in case I'm onto something.

andresailer commented 2 weeks ago

I think the issue is that there is an entry for each sensitive element with its unique path. https://github.com/AIDASoft/DD4hep/blob/3ccf9072b84f2e12dcec491d9078f13899bdd4f9/DDG4/src/Geant4VolumeManager.cpp#L180

MarkusFrankATcernch commented 2 weeks ago

Yes. I confirm this. There is an entry for each path to allow lookups using the touchable history. But: how else would you perform the lookup ? The only alternative is walking down the tree using strings. This has huge run-time hits.

andresailer commented 2 weeks ago

I guess one shouldn't add a DetElement for each fiber, but depend on a segmentation to give a number to the fiber in the tower?

s6anloes commented 2 weeks ago

I'm not sure how this could work. Currently we need to mark the fibres as sensitive for signal generation. We have had some discussion with Sanghyun, whose geometry propagates optical photons, but it takes several minutes to simulate one event. Something we would like to avoid

MarkusFrankATcernch commented 2 weeks ago

This path reflects the path of volumes. Having a DetElement at each level makes things worse, but already the "unfolded" tree with all these little sensitive volumes makes a huge tree with vectors of volume IDs as lookup keys. It is well possible that one would have to somehow develop an alternative lookup mechanism for certain types of readouts.

In DD4hep such situations are meant to be handled by a relatively large sensitive volumes and then the little sensitive elements handled by a segmentation. Example: a wafer is a sensitive volume, the pixels on the wafer are not sensitive volumes, but handled by the segmentation.

In this case the envelope of fibers would be the sensitive volume and the individual fibres would then be handed by a segmentation. If such a adhoc approach is reasonable I cannot tell. Alternatively one tries to seek a model which describes such a setup efficiently.

s6anloes commented 2 weeks ago

I think I understand how this approach might work. Although I have one question. You say the envelope of the fibre would be the sensitive volume, I guess this means the mother volume. For our geometry, the sensible choice for sensitive volume would be three levels of hierarchy higher (the grand-grandmother volume), since each fibre core is placed within a claading volume within a tube volume. And then the tower would be the large sensitive volume which can be segmented. Would this approach still work? It is not clear to me how sensitive volumes treat daughter and grand-daughter volumes and even further. These volumes are no longer 'sensitive' in the sense that the sensitve detector action would not be called for steps in this daughter volume?

MarkusFrankATcernch commented 2 weeks ago

This is sort of the idea behind the segmentation concept. You would get the energy deposit in the grand-grandmother volume and compute the fiber from the location of the energy deposit within this volume.

If this works for fibers (which I guess are thin cylinders) I cannot tell, because there is some space between them filled probably with some glue. This then would not be handled correctly by Geant4, because the glue has different material characteristics than the fibers.

MarkusFrankATcernch commented 2 weeks ago

@s6anloes I tried to somewhat understand the code here: https://github.com/s6anloes/DDDRCaloTubes/blob/689347a36627b16471c012551fc3a7caa250bbe5/DRdetector/DRcalo/src/DRconstructor.cpp Depending on the granularity these are really a lot of volumes since apparently in theta things cannot be re-used, but must be recreated.

Nevertheless: Do you know where the memory really goes ?

andresailer commented 2 weeks ago

I ran heaptrack to monitor allocations

PYTHONMALLOC=malloc heaptrack  python `which ddsim` --compactFile ../DRdetector/DRcalo/compact/DDDRCaloTubes.xml -N1 -G --part.userParticleHandler=''

With and without setting the couple volumes as sensitive. And the line I link to above is the main difference between the two runs, as far as I can tell. This is a bit complicated because I have never used heaptrack before, and the recursion makes callstacks a little bit broader.

s6anloes commented 2 weeks ago

@MarkusFrankATcernch

Hmm, these are really good questions I wished I knew the answer to. I'm not really an expert on these things, so if you know any way I can figure this out, it would be greatly appreciated. The only thing I can tell you, is that the steady and linear increase in memory occurs the moment dd4hep prints the output : "successfully converted geometry to Geant4...". Since not nearly as much memory is used in the dd4hep2root command, my understanding was that it was probably the Geant4 geometry, and not the ROOT geometry.

MarkusFrankATcernch commented 2 weeks ago

@s6anloes Well.... when it says "successfully converted geometry to Geant4..." I think Geant4 is far from having finished its setup. All the voxelization business and I do not know what other internal details are then still going on which may require loads and loads of memory. There are certainly internal caches to speed up tracking etc. What cannot be avoided is the fact that there are 2 geometries in memory: the TGeo geometry and the Geant4 geometry. All this will probably happen when the geometry gets closed just before the event simulation starts and probably is entirely independent of dd4hep.

Now for the facts:

So where does the memory go? One probably can only go through the main steps of setting up Geant4 with the debugger and see in the setup where the memory jumps....

MarkusFrankATcernch commented 2 weeks ago

@s6anloes , @andresailer I do not have /cvmfs/sw.hsf.org/ , but it should also run on any LCG view -- not?

Apparently the LCG views miss the Geant4 data tables:

#13 0x00007efdaad65299 in G4Exception (originOfException=originOfException
entry=0x7efdab1f8d2c "G4NuclideTable", exceptionCode=exceptionCode
entry=0x7efdab1f8d83 "PART70001", severity=severity
entry=FatalException, description=description
entry=0x7efdab1f8d66 "ENSDFSTATE.dat is not found.") at /build/jenkins/workspace/lcg_release_pipeline/build/projects/Geant4-11.2.1/src/Geant4/11.2.1/source/global/management/src/G4Exception.cc:115
#14 0x00007efdab189210 in G4NuclideTable::GenerateNuclide (this=this

Do I miss some environment or is LCG_106 incomplete?

andresailer commented 2 weeks ago
source /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/setup.sh
ddsim --compactFile $DD4hepINSTALL/DDDetectors/compact/SiD.xml -N 2 -G

Works for me on lxplus. Do you maybe also not have /cvmfs/geant4.cern.ch?

G4ENSDFSTATEDATA=/cvmfs/geant4.cern.ch/share/data/G4ENSDFSTATE2.3
MarkusFrankATcernch commented 2 weeks ago

Yes this is the problem: /cvmfs/geant4.cern.ch is missing. I thought the idea of the LCG views is to have everything together in a compact form ?

andresailer commented 2 weeks ago

It seems geant4 cvmfs is the hidden dependency. But you probably have those datafiles then on some LHCb CVMFS repo?

MarkusFrankATcernch commented 2 weeks ago

There are more problems. I tried to build on lxplus, but there I got a clash with python between system python and hsf python:

CMake Error at /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:230 (message):
  Could NOT find Python: Found unsuitable version "3.10", but required is
  exact version "3.10.13" (found /usr/include/python3.11, )
Call Stack (most recent call first):
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPackageHandleStandardArgs.cmake:598 (_FPHSA_FAILURE_MESSAGE)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPython/Support.cmake:3824 (find_package_handle_standard_args)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/cmake/3.27.9-4qfmfr/share/cmake-3.27/Modules/FindPython.cmake:574 (include)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/dd4hep/1.28-q6ea5f/cmake/DD4hepBuild.cmake:693 (FIND_PACKAGE)
  /cvmfs/sw.hsf.org/key4hep/releases/2024-03-10/x86_64-almalinux9-gcc11.3.1-opt/dd4hep/1.28-q6ea5f/cmake/DD4hepConfig.cmake:62 (DD4HEP_SETUP_ROOT_TARGETS)
  CMakeLists.txt:35 (find_package)
peterkostka commented 2 weeks ago

That is working for me:

echo "Sourcing environment dirs for lxplus9 [zsh|bash]" echo "Sourcing environment dirs for AlmaLinux 9.4"

========================================================================================================================

export LIBGL_ALWAYS_INDIRECT=1

Including key4hep (EDM4hep, podio)

source /cvmfs/fcc.cern.ch/sw/latest/setup.sh

source /cvmfs/sft.cern.ch/lcg/views/LCG_105b/x86_64-el9-gcc13-opt/setup.sh export PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/CMake/3.26.2/x86_64-el9-gcc13-opt/bin:$PATH export PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/ninja/1.10.0/x86_64-el9-gcc13-opt/bin:$PATH

export CMAKE_PREFIX_PATH=/cvmfs/sft.cern.ch/lcg/releases/cfitsio/3.48-e4bb8/x86_64-el9-gcc13-dbg/:$CMAKE_PREFIX_PATH export CMAKE_PREFIX_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/hepmc3/3.2.7/x86_64-el9-gcc13-opt/:$CMAKE_PREFIX_PATH export CMAKE_PREFIX_PATH=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/xrootd/5.6.3/x86_64-el9-gcc13-opt/:$CMAKE_PREFIX_PATH

export Python_ROOT_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/Python/3.9.12/x86_64-el9-gcc13-opt/ export Boost_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/Boost/1.82.0/x86_64-el9-gcc13-opt/ export LCIO_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/LCIO/02.20/x86_64-el9-gcc13-opt/ export Qt5_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/qt5/5.15.9/x86_64-el9-gcc13-opt/ export TBB_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/tbb/2021.10.0/x86_64-el9-gcc13-opt/ export VDT_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/vdt/0.4.4/x86_64-el9-gcc13-opt/ export Vc_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/Vc/1.4.4/x86_64-el9-gcc13-opt/ export HEPMC3=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/hepmc3/3.2.7/x86_64-el9-gcc13-opt/ export PYTHIA8=/cvmfs/sft.cern.ch/lcg/releases/MCGenerators/pythia8/310-2f242/x86_64-el9-gcc13-opt export PYTHIA8DATA=/cvmfs/sft.cern.ch/lcg/releases/MCGenerators/pythia8/310-2f242/x86_64-el9-gcc13-opt/share/Pythia8/xmldoc

export XercesC_LIBRARY=/cvmfs/sft.cern.ch/lcg/releases/XercesC/3.2.4-9e637/x86_64-el9-gcc13-opt/lib/libxerces-c.so export XercesC_INCLUDE_DIR=/cvmfs/sft.cern.ch/lcg/releases/XercesC/3.2.4-9e637/x86_64-el9-gcc13-opt//include/ export CLHEP_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/clhep/2.4.7.1/x86_64-el9-gcc13-opt/ export LCIO_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/LCIO/02.20/x86_64-el9-gcc13-opt/ export CMAKE_PREFIX_PATH=$Qt5_DIR:$VDT_DIR:$CMAKE_PREFIX_PATH

ROOT

source /cvmfs/sft.cern.ch/lcg/releases/LCG_105b/ROOT/6.30.06/x86_64-el9-gcc13-opt/bin/thisroot.sh

GEANT4

export Geant4_DIR=/cvmfs/sft.cern.ch/lcg/releases/LCG_105b/Geant4/11.2.0/x86_64-el9-gcc13-opt/ export G4INSTALL=$Geant4_DIR source /cvmfs/sft.cern.ch/lcg/releases/LCG_105b/Geant4/11.2.0/x86_64-el9-gcc13-opt/share/Geant4/geant4make/geant4make.sh; cd -

Message ID: @.***>

MarkusFrankATcernch commented 2 weeks ago

Here are some results from simply using top:

Invocation of TGeo alone:

TGeo:  geoPluginRun -input /scratch/online/frankm/SW/DDDRCaloTubes/install/share/compact/DDDRCaloTubes.xml -interactive -ui

    PID    PPID  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ USER      P COMMAND                                                                                           
1472811 1469076  20   0  744700 568888 432088 T   0.0   0.1   0:20.13 frankm   58 geoPluginRun -input /scratch/online/frankm/SW/DDDRCaloTubes/install/share/compact/DDDRCaloTubes.+ 

Virt: 700 MB Resident: 569 MB

Tests involving Geant4:

    PID    PPID  PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ USER      P COMMAND
Start of DetectorImp::init
1485953 1485861  20   0 1037628 760188 503960 t   0.0   0.1   0:10.87 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

End of DetectorImp::init
1485953 1485861  20   0 1037628 760188 503944 t   0.0   0.1   0:10.88 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Start of dd4hep::DetectorImp::endDocument
1485953 1485861  20   0 1039112 761852 504348 t   0.0   0.1   0:10.92 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

End of dd4hep::DetectorImp::endDocument
1485953 1485861  20   0 1040084 763004 504348 t  85.0   0.1   0:26.60 frankm   47 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Before Geant4Converter:
1485219 1485112  20   0 1097056 807788 530032 t   0.0   0.2   0:28.08 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After Geant4Converter:
1485219 1485112  20   0 1099176 809900 530144 t   0.0   0.2   0:44.64 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Before Geant4VolumeManager:
1485219 1485112  20   0 1099176 809900 530140 t   0.0   0.2   0:44.64 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After Geant4VolumeManager:
1485219 1485112  20   0 1505076   1.2g 530136 t   0.0   0.2   1:45.48 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After dd4hep::sim::Geant4Exec::initialize 
1485219 1485112  20   0 1511056   1.2g 530240 t   0.0   0.2   1:46.10 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+

Start of dd4hep::sim::Geant4Exec::run
1485219 1485112  20   0 1511928   1.2g 530428 t   0.0   0.2   1:46.11 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

After first event:
1485219 1485112  20   0 1520236   1.2g 532180 t   0.0   0.2   1:48.15 frankm   53 /cvmfs/sft.cern.ch/lcg/views/LCG_106/x86_64-el9-gcc13-dbg/bin/python /cvmfs/sft.cern.ch/lcg/v+ 

Hence:

Hence a possible strategy would be:

This is all not implossible, but requires significant work and is not done in an afternoon. We can develop this as a common effort provided several persons work together.....

s6anloes commented 2 weeks ago

How do you know Geant4 is 240MB resident memory? The jump between calling TGeo alone and after Geant4Converter may be 240MB, but it is already close to this at the before Geant4Converter stage, no?

How did you get this output? I would be interested to see how this scales with the full (or at least more complete) geometry. But I guess it does track with what we have seen, that the main culprit is the Geant4VolumeManager given the difference when running with and without sensitive detectors.

@MarkusFrankATcernch In this comment your last point confuses me. It is kind of the opposite of what I was trying to communicate, except for this one geometry, which is not the one you have been testing.

MarkusFrankATcernch commented 2 weeks ago

@s6anloes So what? This is the cost of loading G4. Loading all these libraries is far from free even if nothing is done with them (yet). The volume conversion in this case is apparently not very expensive.

BrieucF commented 2 weeks ago

Regarding @s6anloes description here: https://github.com/AIDASoft/DD4hep/issues/1285#issuecomment-2197471652 . Shall we first try to understand why Scenario 1 leads to no difference with/without sensitive volumes while Scenario 2 leads to significant differences with/without sensitive volumes? Can someone explain that to me?