Closed krasznaa closed 9 months ago
Well it looks fairly sensible at a glance, but I'm curious what your opinions is on how this compares to https://github.com/acts-project/traccc/pull/348; does this allow easy switching between SoA and AoS and SoAoS, etc. layouts like the code that is in traccc does? We don't know what the optimal layout is for a lot of the EDM that we have and it would certainly be nice to be able to do some design space exploration in that aspect.
:thinking: I was not going for flexibility with this. As in practice I don't see an advantage in changing the layout of some data "super easily". I know that that is a fun problem to explore, I'm doing it myself right now as well, but specifically for ATLAS this is the thing that we will need. As ATLAS's offline EDM has, and will continue to have an SoA layout.
It's very good that you reminded me about your code though. I'm absolutely open to taking/stealing design ideas from it. :stuck_out_tongue:
But keep in mind that I want to support not just 1D arrays with this, but containers with a mixture of scalars, vectors and jagged vectors. Which I believe even LLAMA cannot do in a generic way. Though I might be mistaken on that... :thinking:
Yeah I am not saying flexibility should be a design goal per sé, but I am slightly worried about us converting our EDM code to this and finding out that SoA is actually slower (which it certainly can be in some applications) and then it would be a pain in the rear to convert back to an AoS EDM.
That's not to say this code doesn't have value of course, it looks very useful (especially the single-allocation thing!) but I wonder if we could somehow adapt this to allow us to explore the performance impact of layouts.
My current plan is very simple with that... We could convert the traccc container/collection types one by one, and see at every change what happens with our throughput numbers. :wink: We'd anyway need to do this one-by-one, since this will be a fairly big undertaking.
Well I think it's important to consider that the entire point of SoA vs AoS is to improve cache efficiency and due to pollution effects the performance of a kernel will always be dependent on the interplay between the layouts of all the data it uses, which gives a fairly large design space that cannot be explored by iterative changing of layouts. The effect is unlikely to be massive, but still.
I'm perhaps more worried about being unable to benchmark the performance effects of changes to the code under different layouts!
:confused: But didn't you say that super flexible layout changes are not the goal? Unless we make the EDM's layout a compile-time or runtime parameter, we'll always have to put some effort into the testing. And I'm not keen on writing an EDM that can provide this flexibility during the testing, if in the long run it will be harder to maintain than a "single layout". (Hence my worries about adopting something like LLAMA as well.)
We have students/contributors for exactly these types of tasks! I'm really not that worried about the testing aspect of this.
Alright, I'll trust your decision-making on this :+1:
I anyway plan to make the very first performance measurements in traccc while this still remains in the PR. :thinking: And if I get promising numbers, we'd add this specific SoA code in.
(I was considering putting soa
into the namespace names, but then again, all of the other code we have is clearly AoS anyway. :stuck_out_tongue: So we wouldn't add separate AoS code in this project on top of what we already have.)
@stephenswat, we should warm up the discussion here. :thinking:
By now I have a semi-functional re-write of the "cell EDM" in: https://github.com/krasznaa/traccc/tree/CellEDMRedesign-main-20231109 But it's not behaving nearly as well as I hoped. :frowning:
A very preliminary profile, on just a few events, tells me this with the current code:
[bash][atspot01]:traccc > nsys stats --report gpusum ./old_mu20.nsys-rep
Processing [./old_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpusum.py]...
** GPU Summary (Kernels/MemOps) (gpusum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Category Operation
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------- ----------------------------------------------------------------------------------------------------
29.8 223,837,577 1,100 203,488.7 202,027.5 116,871 325,395 42,892.2 CUDA_KERNEL traccc::cuda::kernels::ccl_kernel(vecmem::data::vector_view<const traccc::cell>, vecmem::data::vect…
20.3 153,051,123 4,400 34,784.3 26,529.5 608 144,744 35,078.7 MEMORY_OPER [CUDA memcpy HtoD]
12.8 95,948,962 1,100 87,226.3 83,685.0 44,962 167,274 21,830.9 CUDA_KERNEL traccc::cuda::kernels::find_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::gr…
6.9 51,722,823 1,100 47,020.7 46,162.5 27,458 86,597 10,395.9 CUDA_KERNEL traccc::cuda::kernels::count_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
5.8 43,374,873 1,100 39,431.7 34,193.5 20,481 148,137 16,283.6 CUDA_KERNEL traccc::cuda::kernels::form_spacepoints(vecmem::data::vector_view<const traccc::measurement>, vecme…
5.5 41,279,544 9,900 4,169.7 1,184.0 384 54,115 7,801.3 MEMORY_OPER [CUDA memcpy DtoH]
3.8 28,453,277 1,100 25,866.6 25,585.5 8,801 53,443 7,501.6 CUDA_KERNEL traccc::cuda::kernels::find_triplets(traccc::seedfinder_config, traccc::seedfilter_config, detray::…
3.5 26,380,498 1,100 23,982.3 22,545.5 12,321 93,734 7,557.1 CUDA_KERNEL traccc::cuda::kernels::select_seeds(traccc::seedfilter_config, vecmem::data::vector_view<const trac…
3.3 24,695,085 1,100 22,450.1 21,089.0 9,217 56,227 7,464.1 CUDA_KERNEL traccc::cuda::kernels::count_triplets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
1.9 13,990,744 8,800 1,589.9 928.0 543 14,849 1,422.3 MEMORY_OPER [CUDA memset]
1.8 13,470,929 1,100 12,246.3 11,713.0 4,960 31,778 3,785.2 CUDA_KERNEL traccc::cuda::kernels::estimate_track_params(vecmem::data::vector_view<const traccc::spacepoint>, v…
1.4 10,484,134 1,100 9,531.0 9,056.0 4,064 26,977 3,063.0 CUDA_KERNEL traccc::cuda::kernels::populate_grid(traccc::seedfinder_config, vecmem::data::vector_view<const tra…
1.1 8,186,332 1,100 7,442.1 6,944.0 3,840 19,809 2,391.8 CUDA_KERNEL traccc::cuda::kernels::count_grid_capacities(traccc::seedfinder_config, detray::axis::circular<detr…
0.9 7,097,575 1,100 6,452.3 5,953.0 3,040 26,945 2,115.1 CUDA_KERNEL traccc::cuda::kernels::update_triplet_weights(traccc::seedfilter_config, detray::const_grid2_view<d…
0.7 5,556,318 1,100 5,051.2 4,513.0 2,144 16,096 2,130.8 CUDA_KERNEL traccc::cuda::kernels::fill_prefix_sum(vecmem::data::vector_view<const unsigned int>, vecmem::data:…
0.6 4,622,185 1,100 4,202.0 3,744.0 1,920 15,489 1,826.2 CUDA_KERNEL traccc::cuda::kernels::reduce_triplet_counts(vecmem::data::vector_view<const traccc::device::double…
And this with the "new code":
[bash][atspot01]:traccc > nsys stats --report gpusum ./new_mu20.nsys-rep
Processing [./new_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpusum.py]...
** GPU Summary (Kernels/MemOps) (gpusum):
Time (%) Total Time (ns) Instances Avg (ns) Med (ns) Min (ns) Max (ns) StdDev (ns) Category Operation
-------- --------------- --------- --------- --------- -------- -------- ----------- ----------- ----------------------------------------------------------------------------------------------------
28.8 212,110,658 1,100 192,827.9 189,851.0 113,223 361,237 47,105.8 CUDA_KERNEL traccc::cuda::kernels::ccl_kernel(vecmem::edm::view<vecmem::edm::schema<vecmem::edm::type::vector<c…
22.5 166,304,876 4,400 37,796.6 22,465.5 608 229,261 39,197.0 MEMORY_OPER [CUDA memcpy HtoD]
12.3 90,742,312 1,100 82,493.0 78,772.5 44,067 166,506 20,973.3 CUDA_KERNEL traccc::cuda::kernels::find_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::gr…
6.7 49,536,374 1,100 45,033.1 43,490.0 27,330 81,605 10,164.4 CUDA_KERNEL traccc::cuda::kernels::count_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
6.0 43,892,150 9,900 4,433.6 1,152.0 448 106,630 8,881.8 MEMORY_OPER [CUDA memcpy DtoH]
5.6 41,164,343 1,100 37,422.1 33,698.0 20,929 111,430 13,243.0 CUDA_KERNEL traccc::cuda::kernels::form_spacepoints(vecmem::data::vector_view<const traccc::measurement>, vecme…
3.5 26,107,864 1,100 23,734.4 22,977.0 9,568 46,659 7,174.7 CUDA_KERNEL traccc::cuda::kernels::find_triplets(traccc::seedfinder_config, traccc::seedfilter_config, detray::…
3.3 24,435,899 1,100 22,214.5 20,817.0 11,873 94,822 6,886.4 CUDA_KERNEL traccc::cuda::kernels::select_seeds(traccc::seedfilter_config, vecmem::data::vector_view<const trac…
3.1 22,962,055 1,100 20,874.6 19,553.0 8,257 55,459 7,124.6 CUDA_KERNEL traccc::cuda::kernels::count_triplets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
1.9 13,995,231 8,800 1,590.4 896.0 543 16,545 1,437.0 MEMORY_OPER [CUDA memset]
1.7 12,590,191 1,100 11,445.6 10,657.0 4,864 30,242 3,610.9 CUDA_KERNEL traccc::cuda::kernels::estimate_track_params(vecmem::data::vector_view<const traccc::spacepoint>, v…
1.3 9,251,329 1,100 8,410.3 7,936.0 3,393 27,233 2,887.2 CUDA_KERNEL traccc::cuda::kernels::populate_grid(traccc::seedfinder_config, vecmem::data::vector_view<const tra…
1.1 8,051,474 1,100 7,319.5 6,688.0 3,552 24,289 2,514.2 CUDA_KERNEL traccc::cuda::kernels::count_grid_capacities(traccc::seedfinder_config, detray::axis::circular<detr…
0.9 6,875,930 1,100 6,250.8 5,856.5 3,264 21,761 2,114.1 CUDA_KERNEL traccc::cuda::kernels::update_triplet_weights(traccc::seedfilter_config, detray::const_grid2_view<d…
0.7 5,155,936 1,100 4,687.2 4,256.0 1,920 21,090 1,974.7 CUDA_KERNEL traccc::cuda::kernels::fill_prefix_sum(vecmem::data::vector_view<const unsigned int>, vecmem::data:…
0.6 4,366,034 1,100 3,969.1 3,456.0 1,696 21,153 1,814.8 CUDA_KERNEL traccc::cuda::kernels::reduce_triplet_counts(vecmem::data::vector_view<const traccc::device::double…
Interestingly enough (I didn't even know this...) at μ=20 the job is dominated by the time taken by clusterization. :thinking: Which, the kernel I mean, does become a little faster with the new EDM. But the overheads from memory copies become worse with the new EDM. :frowning: Even though the cell data is copied with the same number of operations. (One copy per SoA container. Just like how there is one copy per AoS containers right now.)
The only thing I can think of right now, is that the current code in this PR is "over allocating" the buffers. But even with that, the amount of memory being copied, should not be over-estimated, even now. :confused:
So, I'm very open to ideas on why the copy of SoA buffers is apparently too slow in the current version of this PR. (To be clear, I'm hoping that just writing the problem down explicitly may help organize my own thoughts as well...)
Hmm... No, the amount of memory copied is not any different...
[bash][atspot01]:traccc > nsys stats --report gpumemsizesum ./old_mu20.nsys-rep
Processing [./old_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpumemsizesum.py]...
** GPU MemOps Summary (by Size) (gpumemsizesum):
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
1,853.491 4,400 0.421 0.254 0.002 1.210 0.437 [CUDA memcpy HtoD]
322.024 9,900 0.033 0.000 0.000 0.467 0.096 [CUDA memcpy DtoH]
39.576 8,800 0.004 0.000 0.000 0.050 0.011 [CUDA memset]
[bash][atspot01]:traccc > nsys stats --report gpumemsizesum ./new_mu20.nsys-rep
Processing [./new_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpumemsizesum.py]...
** GPU MemOps Summary (by Size) (gpumemsizesum):
Total (MB) Count Avg (MB) Med (MB) Min (MB) Max (MB) StdDev (MB) Operation
---------- ----- -------- -------- -------- -------- ----------- ------------------
1,834.251 4,400 0.417 0.254 0.002 1.210 0.432 [CUDA memcpy HtoD]
324.344 9,900 0.033 0.000 0.000 0.467 0.097 [CUDA memcpy DtoH]
39.721 8,800 0.005 0.000 0.000 0.050 0.011 [CUDA memset]
[bash][atspot01]:traccc >
I'm rather getting to a very unhappy explanation... :frowning: You see, in order to copy N (primitive) arrays with a single copy operation, the copy happens in 2 steps:
Apparently creating the "host buffers" in pinned host memory is a bit too expensive. Since this is something that happens anew event after event. While in our current results we take the same AoS data from pinned host memory event after event. (We fill the AoS data into pinned host memory only once, at the beginning of the job.)
So the SoA code seems to be faster for the kernels. And at the same time the current throughput measurements are a bit overly optimistic. Not accounting for the fact that we'll need to allocate pinned host memory over-and-over again if we want to make asynchronous copies happen. :thinking:
I guess time to bring this up on Mattermost as well... :thinking:
Everything useful from this PR is now in main
.
This is still quite early in the development process, but I wanted to open a draft at this point to gather some feedback. This is the "big EDM development" that I've been teasing for traccc since a while.
A "container" is meant to be declared something like the following in this setup:
This would then make the following types available to the code:
some_container::host
the type to use in host code, which usesvecmem::vector
andvecmem::jagged_vector
types underneath;some_container::device
andsome_container::const_device
are the device types to interact with the container in device code;some_container::buffer
the buffer type that can manage the memory of device-only containers;some_container::view
andsome_container::const_view
are views that could be passed to device code.This formalism has 2 major benefits in my mind over writing types explicitly that use N
vecmem::vector
,vecmem::device_vector
, etc. variables themselves:vecmem::edm::buffer
with a single memory allocation on the device! So even containers with many variables, should be possible to copy between the host and the device with a single copy operation. :wink:The code currently cannot handle jagged vectors yet, and I fear that I might need to introduce a "data type" on top of what we have here already. (In fact I'm pretty sure that I'll still need to do that...)
I'm open for criticism. Though I have to say, beside some minor things, I'm fairly pleased with how this development is going. :wink: