SoA EDM Base Classes, main branch (2023.11.02.)

krasznaa commented 10 months ago

This is still quite early in the development process, but I wanted to open a draft at this point to gather some feedback. This is the "big EDM development" that I've been teasing for traccc since a while.

A "container" is meant to be declared something like the following in this setup:

struct some_container
    : public vecmem::edm::container<
          vecmem::edm::type::scalar<int>,
          vecmem::edm::type::vector<float>,
          vecmem::edm::type::scalar<float>,
          vecmem::edm::type::vector<int> > {

    /// @name Accessors to the individual variables in the collection
    /// @{

    /// Global "count" of something
    using count = vecmem::edm::accessor<0, schema>;
    /// "Measurement" of something
    using measurement = vecmem::edm::accessor<1, schema>;
    /// Global "average" of something
    using average = vecmem::edm::accessor<2, schema>;
    /// "Index" of something
    using index = vecmem::edm::accessor<3, schema>;

    /// @}
};

This would then make the following types available to the code:

some_container::host the type to use in host code, which uses vecmem::vector and vecmem::jagged_vector types underneath;
some_container::device and some_container::const_device are the device types to interact with the container in device code;
some_container::buffer the buffer type that can manage the memory of device-only containers;
some_container::view and some_container::const_view are views that could be passed to device code.

This formalism has 2 major benefits in my mind over writing types explicitly that use N vecmem::vector, vecmem::device_vector, etc. variables themselves:

It requires much less code writing for setting up the EDM. As we've seen in earlier experiments in traccc, that just requires a lot of error-prone, repetitive code writing. And I just couldn't find a good way to do that code generation using macros. :frowning:
This setup allows us to implement vecmem::edm::buffer with a single memory allocation on the device! So even containers with many variables, should be possible to copy between the host and the device with a single copy operation. :wink:

The code currently cannot handle jagged vectors yet, and I fear that I might need to introduce a "data type" on top of what we have here already. (In fact I'm pretty sure that I'll still need to do that...)

I'm open for criticism. Though I have to say, beside some minor things, I'm fairly pleased with how this development is going. :wink:

stephenswat commented 10 months ago

Well it looks fairly sensible at a glance, but I'm curious what your opinions is on how this compares to https://github.com/acts-project/traccc/pull/348; does this allow easy switching between SoA and AoS and SoAoS, etc. layouts like the code that is in traccc does? We don't know what the optimal layout is for a lot of the EDM that we have and it would certainly be nice to be able to do some design space exploration in that aspect.

krasznaa commented 10 months ago

:thinking: I was not going for flexibility with this. As in practice I don't see an advantage in changing the layout of some data "super easily". I know that that is a fun problem to explore, I'm doing it myself right now as well, but specifically for ATLAS this is the thing that we will need. As ATLAS's offline EDM has, and will continue to have an SoA layout.

It's very good that you reminded me about your code though. I'm absolutely open to taking/stealing design ideas from it. :stuck_out_tongue:

But keep in mind that I want to support not just 1D arrays with this, but containers with a mixture of scalars, vectors and jagged vectors. Which I believe even LLAMA cannot do in a generic way. Though I might be mistaken on that... :thinking:

stephenswat commented 10 months ago

Yeah I am not saying flexibility should be a design goal per sé, but I am slightly worried about us converting our EDM code to this and finding out that SoA is actually slower (which it certainly can be in some applications) and then it would be a pain in the rear to convert back to an AoS EDM.

That's not to say this code doesn't have value of course, it looks very useful (especially the single-allocation thing!) but I wonder if we could somehow adapt this to allow us to explore the performance impact of layouts.

krasznaa commented 10 months ago

My current plan is very simple with that... We could convert the traccc container/collection types one by one, and see at every change what happens with our throughput numbers. :wink: We'd anyway need to do this one-by-one, since this will be a fairly big undertaking.

stephenswat commented 10 months ago

Well I think it's important to consider that the entire point of SoA vs AoS is to improve cache efficiency and due to pollution effects the performance of a kernel will always be dependent on the interplay between the layouts of all the data it uses, which gives a fairly large design space that cannot be explored by iterative changing of layouts. The effect is unlikely to be massive, but still.

I'm perhaps more worried about being unable to benchmark the performance effects of changes to the code under different layouts!

krasznaa commented 10 months ago

:confused: But didn't you say that super flexible layout changes are not the goal? Unless we make the EDM's layout a compile-time or runtime parameter, we'll always have to put some effort into the testing. And I'm not keen on writing an EDM that can provide this flexibility during the testing, if in the long run it will be harder to maintain than a "single layout". (Hence my worries about adopting something like LLAMA as well.)

We have students/contributors for exactly these types of tasks! I'm really not that worried about the testing aspect of this.

stephenswat commented 10 months ago

Alright, I'll trust your decision-making on this :+1:

krasznaa commented 10 months ago

I anyway plan to make the very first performance measurements in traccc while this still remains in the PR. :thinking: And if I get promising numbers, we'd add this specific SoA code in.

(I was considering putting soa into the namespace names, but then again, all of the other code we have is clearly AoS anyway. :stuck_out_tongue: So we wouldn't add separate AoS code in this project on top of what we already have.)

krasznaa commented 9 months ago

@stephenswat, we should warm up the discussion here. :thinking:

By now I have a semi-functional re-write of the "cell EDM" in: https://github.com/krasznaa/traccc/tree/CellEDMRedesign-main-20231109 But it's not behaving nearly as well as I hoped. :frowning:

High-μ throughput results are not affected in any meaningful way by the EDM change. This is to be expected.
Low-μ throughput numbers are a little lower with the SoA EDM. :frowning:

A very preliminary profile, on just a few events, tells me this with the current code:

[bash][atspot01]:traccc > nsys stats --report gpusum ./old_mu20.nsys-rep 
Processing [./old_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpusum.py]... 

 ** GPU Summary (Kernels/MemOps) (gpusum):

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)   Category                                                 Operation                                              
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------  ----------------------------------------------------------------------------------------------------
     29.8      223,837,577      1,100  203,488.7  202,027.5   116,871   325,395     42,892.2  CUDA_KERNEL  traccc::cuda::kernels::ccl_kernel(vecmem::data::vector_view<const traccc::cell>, vecmem::data::vect…
     20.3      153,051,123      4,400   34,784.3   26,529.5       608   144,744     35,078.7  MEMORY_OPER  [CUDA memcpy HtoD]                                                                                  
     12.8       95,948,962      1,100   87,226.3   83,685.0    44,962   167,274     21,830.9  CUDA_KERNEL  traccc::cuda::kernels::find_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::gr…
      6.9       51,722,823      1,100   47,020.7   46,162.5    27,458    86,597     10,395.9  CUDA_KERNEL  traccc::cuda::kernels::count_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
      5.8       43,374,873      1,100   39,431.7   34,193.5    20,481   148,137     16,283.6  CUDA_KERNEL  traccc::cuda::kernels::form_spacepoints(vecmem::data::vector_view<const traccc::measurement>, vecme…
      5.5       41,279,544      9,900    4,169.7    1,184.0       384    54,115      7,801.3  MEMORY_OPER  [CUDA memcpy DtoH]                                                                                  
      3.8       28,453,277      1,100   25,866.6   25,585.5     8,801    53,443      7,501.6  CUDA_KERNEL  traccc::cuda::kernels::find_triplets(traccc::seedfinder_config, traccc::seedfilter_config, detray::…
      3.5       26,380,498      1,100   23,982.3   22,545.5    12,321    93,734      7,557.1  CUDA_KERNEL  traccc::cuda::kernels::select_seeds(traccc::seedfilter_config, vecmem::data::vector_view<const trac…
      3.3       24,695,085      1,100   22,450.1   21,089.0     9,217    56,227      7,464.1  CUDA_KERNEL  traccc::cuda::kernels::count_triplets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
      1.9       13,990,744      8,800    1,589.9      928.0       543    14,849      1,422.3  MEMORY_OPER  [CUDA memset]                                                                                       
      1.8       13,470,929      1,100   12,246.3   11,713.0     4,960    31,778      3,785.2  CUDA_KERNEL  traccc::cuda::kernels::estimate_track_params(vecmem::data::vector_view<const traccc::spacepoint>, v…
      1.4       10,484,134      1,100    9,531.0    9,056.0     4,064    26,977      3,063.0  CUDA_KERNEL  traccc::cuda::kernels::populate_grid(traccc::seedfinder_config, vecmem::data::vector_view<const tra…
      1.1        8,186,332      1,100    7,442.1    6,944.0     3,840    19,809      2,391.8  CUDA_KERNEL  traccc::cuda::kernels::count_grid_capacities(traccc::seedfinder_config, detray::axis::circular<detr…
      0.9        7,097,575      1,100    6,452.3    5,953.0     3,040    26,945      2,115.1  CUDA_KERNEL  traccc::cuda::kernels::update_triplet_weights(traccc::seedfilter_config, detray::const_grid2_view<d…
      0.7        5,556,318      1,100    5,051.2    4,513.0     2,144    16,096      2,130.8  CUDA_KERNEL  traccc::cuda::kernels::fill_prefix_sum(vecmem::data::vector_view<const unsigned int>, vecmem::data:…
      0.6        4,622,185      1,100    4,202.0    3,744.0     1,920    15,489      1,826.2  CUDA_KERNEL  traccc::cuda::kernels::reduce_triplet_counts(vecmem::data::vector_view<const traccc::device::double…

And this with the "new code":

[bash][atspot01]:traccc > nsys stats --report gpusum ./new_mu20.nsys-rep 
Processing [./new_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpusum.py]... 

 ** GPU Summary (Kernels/MemOps) (gpusum):

 Time (%)  Total Time (ns)  Instances  Avg (ns)   Med (ns)   Min (ns)  Max (ns)  StdDev (ns)   Category                                                 Operation                                              
 --------  ---------------  ---------  ---------  ---------  --------  --------  -----------  -----------  ----------------------------------------------------------------------------------------------------
     28.8      212,110,658      1,100  192,827.9  189,851.0   113,223   361,237     47,105.8  CUDA_KERNEL  traccc::cuda::kernels::ccl_kernel(vecmem::edm::view<vecmem::edm::schema<vecmem::edm::type::vector<c…
     22.5      166,304,876      4,400   37,796.6   22,465.5       608   229,261     39,197.0  MEMORY_OPER  [CUDA memcpy HtoD]                                                                                  
     12.3       90,742,312      1,100   82,493.0   78,772.5    44,067   166,506     20,973.3  CUDA_KERNEL  traccc::cuda::kernels::find_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::gr…
      6.7       49,536,374      1,100   45,033.1   43,490.0    27,330    81,605     10,164.4  CUDA_KERNEL  traccc::cuda::kernels::count_doublets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
      6.0       43,892,150      9,900    4,433.6    1,152.0       448   106,630      8,881.8  MEMORY_OPER  [CUDA memcpy DtoH]                                                                                  
      5.6       41,164,343      1,100   37,422.1   33,698.0    20,929   111,430     13,243.0  CUDA_KERNEL  traccc::cuda::kernels::form_spacepoints(vecmem::data::vector_view<const traccc::measurement>, vecme…
      3.5       26,107,864      1,100   23,734.4   22,977.0     9,568    46,659      7,174.7  CUDA_KERNEL  traccc::cuda::kernels::find_triplets(traccc::seedfinder_config, traccc::seedfilter_config, detray::…
      3.3       24,435,899      1,100   22,214.5   20,817.0    11,873    94,822      6,886.4  CUDA_KERNEL  traccc::cuda::kernels::select_seeds(traccc::seedfilter_config, vecmem::data::vector_view<const trac…
      3.1       22,962,055      1,100   20,874.6   19,553.0     8,257    55,459      7,124.6  CUDA_KERNEL  traccc::cuda::kernels::count_triplets(traccc::seedfinder_config, detray::const_grid2_view<detray::g…
      1.9       13,995,231      8,800    1,590.4      896.0       543    16,545      1,437.0  MEMORY_OPER  [CUDA memset]                                                                                       
      1.7       12,590,191      1,100   11,445.6   10,657.0     4,864    30,242      3,610.9  CUDA_KERNEL  traccc::cuda::kernels::estimate_track_params(vecmem::data::vector_view<const traccc::spacepoint>, v…
      1.3        9,251,329      1,100    8,410.3    7,936.0     3,393    27,233      2,887.2  CUDA_KERNEL  traccc::cuda::kernels::populate_grid(traccc::seedfinder_config, vecmem::data::vector_view<const tra…
      1.1        8,051,474      1,100    7,319.5    6,688.0     3,552    24,289      2,514.2  CUDA_KERNEL  traccc::cuda::kernels::count_grid_capacities(traccc::seedfinder_config, detray::axis::circular<detr…
      0.9        6,875,930      1,100    6,250.8    5,856.5     3,264    21,761      2,114.1  CUDA_KERNEL  traccc::cuda::kernels::update_triplet_weights(traccc::seedfilter_config, detray::const_grid2_view<d…
      0.7        5,155,936      1,100    4,687.2    4,256.0     1,920    21,090      1,974.7  CUDA_KERNEL  traccc::cuda::kernels::fill_prefix_sum(vecmem::data::vector_view<const unsigned int>, vecmem::data:…
      0.6        4,366,034      1,100    3,969.1    3,456.0     1,696    21,153      1,814.8  CUDA_KERNEL  traccc::cuda::kernels::reduce_triplet_counts(vecmem::data::vector_view<const traccc::device::double…

Interestingly enough (I didn't even know this...) at μ=20 the job is dominated by the time taken by clusterization. :thinking: Which, the kernel I mean, does become a little faster with the new EDM. But the overheads from memory copies become worse with the new EDM. :frowning: Even though the cell data is copied with the same number of operations. (One copy per SoA container. Just like how there is one copy per AoS containers right now.)

The only thing I can think of right now, is that the current code in this PR is "over allocating" the buffers. But even with that, the amount of memory being copied, should not be over-estimated, even now. :confused:

So, I'm very open to ideas on why the copy of SoA buffers is apparently too slow in the current version of this PR. (To be clear, I'm hoping that just writing the problem down explicitly may help organize my own thoughts as well...)

krasznaa commented 9 months ago

Hmm... No, the amount of memory copied is not any different...

[bash][atspot01]:traccc > nsys stats --report gpumemsizesum ./old_mu20.nsys-rep 
Processing [./old_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpumemsizesum.py]... 

 ** GPU MemOps Summary (by Size) (gpumemsizesum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
  1,853.491  4,400     0.421     0.254     0.002     1.210        0.437  [CUDA memcpy HtoD]
    322.024  9,900     0.033     0.000     0.000     0.467        0.096  [CUDA memcpy DtoH]
     39.576  8,800     0.004     0.000     0.000     0.050        0.011  [CUDA memset]     

[bash][atspot01]:traccc > nsys stats --report gpumemsizesum ./new_mu20.nsys-rep 
Processing [./new_mu20.sqlite] with [/software/cuda/12.0.1/x86_64/nsight-systems-2022.4.2/host-linux-x64/reports/gpumemsizesum.py]... 

 ** GPU MemOps Summary (by Size) (gpumemsizesum):

 Total (MB)  Count  Avg (MB)  Med (MB)  Min (MB)  Max (MB)  StdDev (MB)      Operation     
 ----------  -----  --------  --------  --------  --------  -----------  ------------------
  1,834.251  4,400     0.417     0.254     0.002     1.210        0.432  [CUDA memcpy HtoD]
    324.344  9,900     0.033     0.000     0.000     0.467        0.097  [CUDA memcpy DtoH]
     39.721  8,800     0.005     0.000     0.000     0.050        0.011  [CUDA memset]     

[bash][atspot01]:traccc >

I'm rather getting to a very unhappy explanation... :frowning: You see, in order to copy N (primitive) arrays with a single copy operation, the copy happens in 2 steps:

A "host buffer" is created, which is contiguous in memory, and H->H copies fill it with the cell data;
A "device buffer" is also created, and a single H->D copy fills it from the "host buffer".

Apparently creating the "host buffers" in pinned host memory is a bit too expensive. Since this is something that happens anew event after event. While in our current results we take the same AoS data from pinned host memory event after event. (We fill the AoS data into pinned host memory only once, at the beginning of the job.)

So the SoA code seems to be faster for the kernels. And at the same time the current throughput measurements are a bit overly optimistic. Not accounting for the fact that we'll need to allocate pinned host memory over-and-over again if we want to make asynchronous copies happen. :thinking:

I guess time to bring this up on Mattermost as well... :thinking:

krasznaa commented 9 months ago

Everything useful from this PR is now in main.

acts-project / vecmem

SoA EDM Base Classes, main branch (2023.11.02.) #246