more performant build of Geant4 by default

tomeichlersmith commented 5 months ago

The default value of a few configuration parameters in Geant4 have values that are helpful for initial development of a Geant4 application but are unhelpful when attempting to scale up to larger production systems. I think, especially with the introduction of #83 enabling local custom G4 builds, we could update the configuration of G4 to be focused on performance only and guide people to rebuild locally with different configuration if they need more information from the G4 side for debugging/development purposes.

I use cmake -LAH to list all of the available options and noticed a few that have some non-performant defaults.

Option	Default	Description
CMAKE_BUILD_TYPE	RelWithDebInfo	type of build, this reduces the optimization and enables storing of the debug flags
GEANT4_BUILD_VERBOSE_CODE	ON	include extra comments and checks during the simulation
GEANT4_BUILD_STORE_TRAJECTORY	ON	build full trajectories during the simulation for visualization purposes

I think the last option could be significantly contributing to simulation time since Geant4 is storing a particles full trajectory during its lifetime which could amount to many hundreds of steps that it needs to allocate and de-allocate on each track.

Context

Discovered while working on https://github.com/LDMX-Software/geant4/issues/13

tomeichlersmith commented 5 months ago

I think patching this should go along with some notes on the doc page about how we go about building Geant4. This can hopefully help in the future when updating the version of Geant4 so that we can apply the same optimizations and have the same capabilities enabled.

bryngemark commented 5 months ago

This sounds like a good idea. @cmantill @AnmolS1Z are we using the G4 trajectories at all in your recent event display work? In that case that's one good use case for typing up documentation of how to build with other make flags.

EinarElen commented 5 months ago

Afaik in most projects the relwithdebinfo doesn't affect optimization levels, it is just storing the debug symbols. Both settings should also disable asserts (NDEBUG). Is Geant4's doing different optimization levels for the two? (-O3 vs -O2)? If they are, I wouldn't immediately assume O3 is faster than O2 because... Optimization is complicated and the same optimization on one machine might not produce the same results on a different one (and if you have bad luck, different results depending on your username) (no I am not kidding about the last one).

In other words, I don't think that the choice between release and relwithdebinfo for production containers is clear, would need testing. For the development container, I would definitely keep debug symbols. The potential perf improvements from O2/O3 (if Geant4 switches between them here) is probably going to be minor and when you need them, having debug symbols is useful. One example is if a student would have a strange problem with their setup (e.g. a core dump). I would be more comfortable telling them to run ldmx gdb fire ... r and send the crash details than having them go through the process of building a dedicated debug setup

Other settings seem like more straight forward improvements. I wouldn't be surprised if the verbose code part also mattered, a lot of geant4 has a lot of if (verbose) that would get removed by it

cmantill commented 5 months ago

As far as I can tell we are not using G4 trajectories in the event display. I am not familiar enough with tracking though yet, I imagine it does not use them either?

tomeichlersmith commented 5 months ago

Optimizations

Thank you for your comments @EinarElen I am less learned about this then you, so it is good to hear that O2 and O3 are "close" and will need to be tested. I am getting the compiler flags from cmake -LAH in the build directory and I see

[ldmx] eichl008@spa-cms017 ~/ldmx/geant4/biaspatch/build> ldmx cmake -LAH . |\
    egrep '^CMAKE_C(XX|)_FLAGS_(RELEASE|RELWITHDEBINFO)'
CMAKE_CXX_FLAGS_RELEASE:STRING=-O2 -DNDEBUG
CMAKE_CXX_FLAGS_RELWITHDEBINFO:STRING=-O2 -g
CMAKE_C_FLAGS_RELEASE:STRING=-O3 -DNDEBUG
CMAKE_C_FLAGS_RELWITHDEBINFO:STRING=-O2 -g -DNDEBUG

So RelWithDebInfo builds do not disable asserts but they do use the same optimization level as Release. The C flags do follow your assumptions and that must be what I was looking at earlier when I made the original comment above.

Visualization

The trajectories created during the simulation contain a stack of every G4Step that the G4Track undergoes. These trajectories are not stored into the output file unless special care is taken, but they are used by Geant4's visualization engine if you are visualizing events while you are generating them. I think if we ever want to persist the full trajectory of a track, we should do so with a custom UserAction so that it is purely on an opt-in basis. This method would not have the custom (and presumably optimized) allocators, but I think disabling them for a majority of tracks will be time saving. I hope to test this more directly in the coming weeks with some inclusive samples.

EinarElen commented 5 months ago

If there are any pure build time changes that I would consider trying to add to the production container environment it would be

Link Geant4 statically instead of as a shared library

Not uncommon to see ~10% perf difference there. Should not be done for the development container since that would prevent you from switching G4 versions.

Try enabling link time/interprocedural optimizations (two words for the same thing) In some cases, and especially with a static build this can give a decent bump. Major downside is that the compilation time for while building the container image will be significantly slower.

The last one I think we should at some point consider adding to our CI together with sanitizers since LTO can improve the sanitizer results and catch UB from ODR violations

EinarElen commented 5 months ago

If you want to get real spicy there is also a pretty decent performance gain that can be had by compiling with native architecture flags. Basically, by default the compilers will assume that your CPU is some form of lowest common denominator and only use instructions that are valid for these and won't use any knowledge about the specific CPU family that the code is supposed to run on when optimizing. One big perf issue here comes from the compiler being unable to make use of so-called SIMD instructions (instruction level vectorization).

The issue with using these flags is twofold

Since we enable vectorized operations on floating point values (which are not associative) you will get slightly different results on different machines. The physics represented is the same though.
Using them requires knowledge of what platform (or at least a lowest common denominator for it) we actually are going to run on. If there is a mismatch between the level that we ask for and the actual hardware and you ask for a more recent platform than the hardware has, you risk pessimizing or crashing

tomeichlersmith commented 5 months ago

adding to the production container environment

The difficulty here is that, right now, we are building the production image on top of the development image. This has been extremely beneficial to us since we can avoid any concern about differences between our local testing setup and the remote production setup. If we wanted to have a production-specific build (e.g. with extra optimizations), we would need to evolve this repo so that it could have two build "modes" that only differ slightly. I am unsure on how to do this well and stably although, I'm sure we could hack something together with ARG.

Don't get me wrong, I think the next evolution of this ecosystem would separate the production image from the development image so that it could be more focused on performance. This would also mean we could be more willing to introduce dev tooling into the development image knowing that we wouldn't need to carrying them around uselessly within a production image. I'm just not sure how to implement this next evolution in a way that is maintainable and understandable.

EinarElen commented 5 months ago

Oh wow, I didn't actually know that the two were linked.

In my mind, having a somewhat separate production and dev environment isn't the worst idea in the world. The CI we are currently using is robust enough for the kinds of issues that a split dev/production environment would cause. I would be even more confident in that if we got around to including some more static/runtime analysis in the CI setup.

Would love to hear @bryngemark's thoughts about it

tomeichlersmith commented 4 months ago

Using a Geant4 build with -DCMAKE_BUILD_TYPE=Release -DGEANT4_BUILD_VERBOSE_CODE=OFF -DGEANT4_BUILD_STORE_TRAJECTORIES=OFF gives a ~5% speed up for both of our main beam energies.

timing

EinarElen commented 4 months ago

Can you try with/without the release flag? I'm curious if it contributes. Also, before actually implementing this we need to run the same checks on the different clusters that we have, different hardware can have really different perf properties

tomeichlersmith commented 4 months ago

Omitting the -DCMAKE_BUILD_TYPE=Release configuration has little effect on the performance. As expected, the increase is due to the verbose and trajectory storage code.

timing

LDMX-Software / docker