acts-project / traccc

Demonstrator tracking chain on accelerators
Mozilla Public License 2.0
29 stars 50 forks source link

Cell EDM Rewrite, main branch (2024.09.18.) #712

Closed krasznaa closed 1 month ago

krasznaa commented 2 months ago

This is the next monster PR... Exchanging traccc::cell_collection_types and traccc::cluster_container_types with SoA versions.

To jump right to the chase: It doesn't bring any performance improvement. :frowning: This EDM change of course only affects clusterization. Which is already one of the fastest things that we run. Still if anything, I see an O(1%) performance drop during the TML $\mu$=200 throughput measurements with this update applied. :thinking:

On my RTX3080 I get the following with the current main branch:

[bash][Legolas]:traccc > ./build-orig/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5475 ms
            Warm-up processing  359 ms
              Event processing  2814 ms
Throughput:
            Warm-up processing  3.59883 ms/event, 277.868 events/s
              Event processing  2.81497 ms/event, 355.244 events/s
[bash][Legolas]:traccc >

While this PR produces the following:

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000
...
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  5471 ms
            Warm-up processing  305 ms
              Event processing  2830 ms
Throughput:
            Warm-up processing  3.0542 ms/event, 327.418 events/s
              Event processing  2.83068 ms/event, 353.272 events/s
[bash][Legolas]:traccc >

(There is variation on these numbers, but the "new" code is always just a little slower. :frowning:)

About the code:

We'll have to do some profiling, but I suspect that the small performance drop comes from the fact that the PR's code always reads the cell data from global memory, whenever it needs it. Just loading some of the info into local registers in a couple of places will hopefully take us back to the previous performance. I just didn't want to complicate the code even further in this PR. :thinking:

This PR also closes #691.

stephenswat commented 2 months ago

As mentioned, please increase the minimum vecmem version to 1.8.0.

beomki-yeo commented 2 months ago

Hmm. are you going to go through this for every EDM class?

beomki-yeo commented 2 months ago

This will also have lots of conflictions with #692 :crying_cat_face:

stephenswat commented 2 months ago

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

krasznaa commented 2 months ago

This will also have lots of conflictions with #692 😿

The good news is that when the proxy objects are implemented, most of the code will remain unchanged.

Some further developments are indeed underway...

krasznaa commented 2 months ago

All of you, hold onto your hats. :smile: If/once we settle on https://github.com/acts-project/vecmem/pull/296, these are the types of updates that we will need to do to switch from the current AoS to a new SoA EDM:

image

At the same time I looked at profiles of the throughput application a little as well. This was very educational. As it turns out, the small slowdown is not due to the kernels. It seems to be due to the code spending a little more time on memory copies. :thinking:

That's not great news, as apparently the vecmem::edm code is not quite as efficient wrt. CPU usage as I hoped. But at least the SoA layout doesn't seem to have much of an impact on clusterization after all. (Remember, even with the current AoS layout, since traccc::cell is tiny, the memory access pattern of clusterization is pretty efficient already.)

krasznaa commented 2 months ago

The good news is that once the code finally starts working on all platforms with all compilers, this very latest version is finally delivering on the performance front. :smile:

[bash][Legolas]:traccc > ./out/build/cuda/bin/traccc_throughput_st_cuda --input-directory=tml_full/ttbar_mu200/ --input-events=10 --cold-run-events=100 --processed-events=1000

Running Single-threaded CUDA GPU throughput tests

>>> Detector Options <<<
  Detector file       : tml_detector/trackml-detector.csv
  Material file       : 
  Surface grid file   : 
  Use detray::detector: no
  Digitization file   : tml_detector/default-geometric-config-generic.json
>>> Input Data Options <<<
  Input data format             : csv
  Input directory               : tml_full/ttbar_mu200/
  Number of input events        : 10
  Number of input events to skip: 0
>>> Clusterization Options <<<
  Threads per partition:      256
  Target cells per thread:    8
  Max cells per thread:       16
  Scratch space size mult.:   256
>>> Track Seeding Options <<<
  None
>>> Track Finding Options <<<
  Max number of branches per seed: 10
  Max number of branches per surface: 10
  Track candidates range   : 3:100
  Minimum step length for the next surface: 0.5 [mm] 
  Maximum step counts for the next surface: 100
  Maximum Chi2             : 30
  Maximum branches per step: 10
  Maximum number of skipped steps per candidates: 3
  PDG Number: 13
>>> Track Propagation Options <<<
Navigation
----------------------------
  Min. mask tolerance   : 1e-05 [mm]
  Max. mask tolerance   : 1 [mm]
  Mask tolerance scalor : 0.05
  Path tolerance        : 1 [um]
  Overstep tolerance    : -100 [um]
  Search window         : 0 x 0

Parameter Transport
----------------------------
  Min. Stepsize         : 0.0001 [mm]
  Runge-Kutta tolerance : 0.0001 [mm]
  Max. step updates     : 10000
  Stepsize  constraint  : 3.40282e+38 [mm]
  Path limit            : 5 [m]
  Use Bethe energy loss : true
  Do cov. transport     : true
  Use eloss gradient    : false
  Use B-field gradient  : false

>>> Throughput Measurement Options <<<
  Cold run event(s) : 100
  Processed event(s): 1000
  Log file          : 

WARNING: @traccc::io::csv::read_cells: 251 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 305 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 176 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 200 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 224 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 170 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 321 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 322 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 222 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 118 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/tml_full/ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
                  File reading  4968 ms
            Warm-up processing  358 ms
              Event processing  2715 ms
Throughput:
            Warm-up processing  3.58551 ms/event, 278.9 events/s
              Event processing  2.71537 ms/event, 368.274 events/s
[bash][Legolas]:traccc >

Though I am a little bit afraid of this possibly being artificial. Since the previous result was on x86_64-ubuntu2204-gcc11-opt, and these latest numbers are now on x86_64-ubuntu2404-gcc13-opt. (I upgraded my home PC during the weekend... :stuck_out_tongue:) Still, at least the hardware is still the same... :thinking:

krasznaa commented 1 month ago

Quality Gate Failed Quality Gate failed

Failed conditions 2 New Bugs (required ≤ 0) C Reliability Rating on New Code (required ≥ A)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint

Huhh... :thinking: What's your take on these errors @stephenswat?

stephenswat commented 1 month ago

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

krasznaa commented 1 month ago

Huhh... 🤔 What's your take on these errors @stephenswat?

SonarCloud actually makes a really valid point here about constraining universal references; I'd suggest we go ahead and implement them.

As long as you have a concrete idea of how to go about it, I'm happy to let you propose the improvement. :wink:

stephenswat commented 1 month ago

Okay, I guess we need to get vecmem 1.10.0 and then we can put this in, right?

sonarcloud[bot] commented 1 month ago

Quality Gate Failed Quality Gate failed

Failed conditions
C Reliability Rating on New Code (required ≥ A)
2 New Bugs (required ≤ 0)

See analysis details on SonarCloud

Catch issues before they fail your Quality Gate with our IDE extension SonarLint