celeritas-project / celeritas

Celeritas is a new Monte Carlo transport code designed to accelerate scientific discovery in high energy physics by improving detector simulation throughput and energy efficiency using GPUs.
https://celeritas-project.github.io/celeritas/user/index.html
Other
63 stars 34 forks source link

Improve ORANGE memory usage and "background" performance #1334

Open sethrj opened 3 months ago

sethrj commented 3 months ago

On frontier, something catastrophic has happened to the performance of ORANGE in a couple of problems. Here's the relative throughput of 0.5.0-dev.202+141cd4928 compared to v0.4.4: throughput

This is not due to a change in the number of steps (so the amount of work is the same): work

As a reminder, the problem complexity and abbreviations: problem-complexity names and 'M̃' means MSC is disabled.

sethrj commented 3 months ago

@esseivaju 's runs on perlmutter confirm this is an ORANGE issue: throughput

sethrj commented 3 months ago

OK, the difference was from changing from the pregenerated .org.json to the built-in GDML generator... not clear why

sethrj commented 3 months ago

@esseivaju has been doing some profiling of the along-step kernels. Long story short: the T- cases ("EM flat") are doing the same number of operations but the memory access are much more expensive.

It seems the main difference is the empty "calorimeter" volume, which is a box with all the inner boxes subtracted out: the manually constructed version doesn't have this. Even though the volume is never actually entered, it results in a logic expression for the volume that's huge, leading to a major change in "max faces" and "max intersections" which are used by the ORANGE state data:

 "orange": {
 "scalars": {
 "max_depth": 1,
-"max_faces": 12,
-"max_intersections": 12,
-"max_logic_depth": 3,
+"max_faces": 111,
+"max_intersections": 111,
+"max_logic_depth": 2,
 "tol": {
-"abs": 1e-08,
-"rel": 1e-08
+"abs": 1.5e-09,
+"rel": 1.5e-08
 }
 },
 "sizes": {
 "bih": {
 "bboxes": 102,
-"inner_nodes": 0,
-"leaf_nodes": 1,
+"inner_nodes": 99,
+"leaf_nodes": 100,
 "local_volume_ids": 102
 },
 "connectivity_records": 111,
 "daughters": 0,
-"local_surface_ids": 618,
-"local_volume_ids": 305,
-"logic_ints": 87,
+"local_surface_ids": 717,
+"local_volume_ids": 301,
+"logic_ints": 45,
 "real_ids": 111,
-"reals": 103,
+"reals": 104,
 "rect_arrays": 0,
 "simple_units": 1,
 "surface_types": 111,

I'm not sure whether our priority should be (1) patch this up somehow, (2) implementing different striding patterns and padding for GPU/CPU so that we can coalesce memory accesses in SimpleUnitTracker::simple_intersect/SimpleUnitTracker::intersect_impl; or (3) speed up our plans for short circuiting and evaluate the boundary distances on the fly rather than storing them to effectively "local" memory.

sethrj commented 3 months ago

BTW I found this using git bisect: very glad (@pcanal ) that we use a squash merge strategy 😉

$ git bisect log
# bad: [21940769b1bfba623289ad1ebb3fe30b81d1b5e4] Update frontier toolchain and backward compatibility (#1330)
# good: [b1e52f2001c73fd887ab26dfa100480b5b146167] Add Cerenkov distribution and generator (#1080)
git bisect start 'HEAD' 'v0.5.0-dev'
# bad: [89336e59b66fad695833cba4c65fac976918e60e] Add Windows/Linux no-dependency builds (#1196)
git bisect bad 89336e59b66fad695833cba4c65fac976918e60e
# good: [18f4ae60832c06e70d24b915f0781df38474e038] Fix missing Werror in build-fast workflow (#1141)
git bisect good 18f4ae60832c06e70d24b915f0781df38474e038
# good: [4dd382ff2bcd4f1a5c4ca9a0605c8e28474b0f27] Add sense evaluator for testing (#1168)
git bisect good 4dd382ff2bcd4f1a5c4ca9a0605c8e28474b0f27
# good: [e1c7aedce274dece374ffeb8e093c1793d5b2e77] Pin sphinx at 7.2 to fix user doc build (#1188)
git bisect good e1c7aedce274dece374ffeb8e093c1793d5b2e77
# bad: [57ef806c1424e11f03299f0443fe60396e28e77d] Update esseivaj user presets (#1195)
git bisect bad 57ef806c1424e11f03299f0443fe60396e28e77d
# bad: [2be4ebb86ff1b764a04d7e931f25b78d56c963ae] Switch ORANGE unit tests to use GDML files (#1181)
git bisect bad 2be4ebb86ff1b764a04d7e931f25b78d56c963ae
# skip: [4e9676e42f064a40e9fdc4b0e949e128c2b99d1a] Define geometry traits (#1190)
git bisect skip 4e9676e42f064a40e9fdc4b0e949e128c2b99d1a
# skip: [fe2611587b4b4a1e4f6a71e33d3ebcb6e4332041] Complete GDML-to-ORANGE geometry converter (#1180)
git bisect skip fe2611587b4b4a1e4f6a71e33d3ebcb6e4332041
# only skipped commits left to test
# possible first bad commit: [2be4ebb86ff1b764a04d7e931f25b78d56c963ae] Switch ORANGE unit tests to use GDML files (#1181)
# possible first bad commit: [4e9676e42f064a40e9fdc4b0e949e128c2b99d1a] Define geometry traits (#1190)
# possible first bad commit: [fe2611587b4b4a1e4f6a71e33d3ebcb6e4332041] Complete GDML-to-ORANGE geometry converter (#1180)
sethrj commented 3 months ago

Yeah the ugly length-111 cell is the "background" volume (world).

sethrj commented 3 months ago

It seems that the only real slowdown is because of the less compact data layout for the temporary states: I don't think the background cell is even being reached in practice. So we really need a more optimal way (BIH/BVH traversal) for tracking across these.

sethrj commented 1 month ago

The plan discussed with @elliottbiondo:

  1. Refactor the background intersect from "all surfaces then volumes" to "all volumes then surfaces", using the BIH bounding boxes as a first pass for testing intersection. If it intersects the AABB, then we have to test all external surfaces for the volume, sort, bump past each surface (except for the last maybe?), test whether we end up inside the volume.
  2. Instead of using a bbox test, use an OBB test.
  3. Instead of looping over all volumes, track through a bounding tree.