Deal with memory limitations on the device

hahnjo commented 3 years ago

GPUs have limited global memory (compared to current host systems, which can also be extended with swap) and it's best to avoid dynamic memory allocations while executing a region. For that reason, we allocate buffers of a fixed capacity upfront and reuse them as much as possible during simulation. This can lead to the situation that there isn't enough space to produce a secondary particle. The current approach is to terminate the simulation in such cases (see #64), but it would be better to handle this more gracefully.

hahnjo commented 3 years ago

Today we briefly discussed not producing the secondary and depositing its energy directly, maybe counting how often this happened.

A more elaborate scheme could be to avoid this situation before launching a kernel that could run into this problem, by making sure that there is enough space such that all processes could produce their maximum number of secondaries. For example, if all processes produce at most one secondary (assuming we use the same buffers for electrons and gammas; to be checked), it's sufficient if the buffer has at least twice the capacity compared to the currently used slots. If we have separate buffers for each particle type, we have to check the available slots in the right buffer.

To keep going as long as possible, we might schedule processes first that do not produce secondaries or even lead to particles being killed. Another option might be to prioritize particles with lower energy once a certain threshold of the buffer(s) is / are used, at the expense of reducing the amount of parallel work (so we need to make sure that the buffers are large enough and the threshold is such that it still allows decent efficiency).

[ brain dump off ]

agheata commented 3 years ago

About preempting the space needed per process, I think this is a good approach. A possible way to proceed is to partition the available tracks storage into smaller (32K/64K tracks?) blocks, both for the input of processes and for the secondary tracks. For input, this is needed in case we discover that splitting the work in multiple streams increases occupancy, for output, we can give the current block if the remaining space is deemed enough, or a fresh new block if not. We could even provide 2 output blocks for an input block of gammas to store the e+ and the e- from the pair production process. In any case, we likely will need some TrackManager to manage the partitioning in blocks and give the appropriate blocks to the process manager.

hahnjo commented 2 years ago

Still relevant, or even more relevant now that we're not reusing memory slots. However, I think it doesn't make sense to work on this until we know that GPU simulation is faster than Geant4 and beneficial - any management overhead can only decrease performance.

apt-sim / AdePT

Deal with memory limitations on the device #65