cholla-hydro / cholla

A GPU-based hydro code
https://github.com/cholla-hydro/cholla/wiki
MIT License
60 stars 32 forks source link

Reconstruction Kernel Fusion 4: Fusing PLMC with the Riemann Solvers #382

Closed bcaddy closed 2 months ago

bcaddy commented 3 months ago

This PR fuses the PLMC reconstruction into the Riemann solvers. Unfortunately, this significantly increases register usage (especially on AMD) and as a result is slower than the unfused version. Because of this I'm pointing this PR at a new branch instead of dev so that this is available if anyone wants to look/work on it but isn't in dev.

It's worth noting that the register pressure issue is much larger on AMD GPUs than NVIDIA ones. As an example, on the V100 and A100 the fused HLLD+PLMC kernel uses ~180 registers whereas on the MI250X it uses 218. I think this is probably a compiler optimization issue but I'm not sure. Register usage and timing information can be found in these files: run_timing_A100.log run_timing_MI250X.log run_timing_V100.log

I think register usage could be significantly reduced in 2 ways: using shared memory and primitive reconstruction.

Shared Memory: Currently the threads within a block always increase in x, i.e. thread 1 is at location x+1 from thread 0. If instead the thread locations increased with the direction of the solve/reconstruction (i.e. for the y direction thread 1 is at location y+1 from thread 0) then the primitive variables could be stored in shared memory and shared between the threads in a block, hopefully reducing the global memory and register pressure in the process.

Primitive reconstruction: Characteristic reconstruction requires that the full state of all cells in the stencil, their slopes, and the characteristic slopes all need to be in registers at once. With primitive reconstruction though we can go field-by-field with a method that looks something like this

  1. Load all primitive variables into shared memory then __synchthreads()
  2. load all densities in the stencil
  3. compute density interface
  4. write density interfaces to shared memory (possibly in the same buffer as the primitive variables, would need to check for data races)
  5. repeat steps 2-4 for velocity, pressure, dual energy, passive scalars, and magnetic field
  6. load interface states from shared memory and return them to the riemann solver

I think this method would significantly reduce register pressure and, hopefully, improve memory access times.

bcaddy commented 2 months ago

I'm going to close this. I'm planning on integrating most of the changes (except the fusion part) into a PR that will merge the PLMC and PLMP reconstructors into one. The actual kernel fusion part of this is pretty simple after the other changes are done.

bcaddy commented 2 months ago

If anyone comes back to this. The kernel fusion related changes are in commits ece7f2b through d7e2527