cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
63 stars 32 forks source link

Rethinking the Particle Data Structures #413

Open mabruzzo opened 1 month ago

mabruzzo commented 1 month ago

While thinking about how to modify the particle system to better support star-particle creation, I had some thoughts on how we could potentially improve the particle system in general.

These are mostly thoughts for an "ideal world." I'm not really suggesting that anybody should spend time doing this.

  1. support different particle "types": so that we could support star particles and sink particles. (Ideally with the option for them to have different particle attributes -- but that's obviously far less important)

  2. rather than organizing particles in a giant array, it would be worth reorganizing them into "batches" (or "chunks")

    • For some datatype T, rather than storing the values in 1 giant array, we would instead store the values in a series of chunks.
    • For the sake of concreteness, a naive implementation of this might look like:

      template<typename T>
      struct Batch{
      int size;
      T values[CHUNK_CAPACITY]; };
      
      template<typename T>
      struct ChunkedStorage{
      int num_batches;
      Batch<T>* batches;
      };

      In the above snippet, CHUNK_CAPACITY is some compile-time constant (e.g. 8, 16, 32, etc.).

      The control-flow would change from

      for (int i =0; i < num_particles; i++) {
      // do work
      }

      to

      for (int batch_ind =0; batch_ind < obj.num_batches; batch_ind++) {
      for (int i =0; i < obj.num_batches[batch_ind].size; i++) {
       // do work on obj.batches[batch_ind].values[i]
      }
      }
    • The above snippets made a number of simplifications. We could make a bunch of optimizations[^1]
    • The primary benefit is that it is much faster to remove values from this data structure (the worst-case cost is based on the max-size of a chunk) in comparison to the existing approach (the worst-case cost scales with the number of contained values). This comes up now whenever particles move between processors.
    • The obvious drawback: we are introducing complexity and we slightly increase the cost to access an arbitrary element.
    • It's not entirely obvious to me whether the extra complexity is worthwhile on GPUs
    • I imagine that most of the semantics for adding/removing values could be modeled after an unrolled linked list

[^1]: For example, it may be better to store batch size separately from the data in a given batch (you could do away with the Batch<T> Class template and store the values directly within Batches). If we also pre-allocate the memory for the maximum allowed number of particles (which would probably be beneficial), you could allocate all of the memory for all batches in a single array (rather than storing pointers to each batch, you could then store the index of each batch).

bvillasen commented 1 month ago

My thoughts:

I like idea number one. Although it might get complicated to keep particles of different types in the same data structure. If someone needs particles of different types in the same run it probably would be better to have multiple particle data structures, each for each type.

I don't see the benefit of the second idea. You mention that the benefit would be to loop over the batch when removing particles, instead of looping over the entire particle array. But you would have to do that for all batches, so I don't see what's the advantage. Regardless of how you do it, you still have to check all particles to see which ones need to be moved to another GPU. Also, dividing the particles into small groups like that would require a bunch of house keeping that would complicate stuff. Just to start, you would need to track how many particles are in each batch, as particles move to other GPUs and as a GPU receives particles, you would have to accommodate them into the empty spaces of different batches otherwise you would be wasting memory with all the empty spaces and you would have to track which spaces are empty. Also you don't want to launch many tiny kernels (one for each batch), this would introduce a lot of overhead from the launch latency and the GPU will be very underutilized. So, this really would complicate stuff a lot and would probably make things slower.

That's my opinion, I'm happy to discuss further if you want.