GPU Interop - Githubissues

braxtoncuneo commented 4 months ago

Incorporates GPU interop via Harmonize, achieved through the following changes:

Restructuring of loop.py:
- In order to execute particle simulation in parallel, the functionality that needs to be evaluated in parallel needs to be separated from the code that executes this functionality in the typical (serial) fashion. Hence, loop_source and loop_precursor_source must be decomposed into their constituent components.
- generate_source_particle : does as it says, separated out so that it may be called separately within the make_work function of the GPU program
- step_particle : essentially is the body of the while loop in loop_particle, separated out so that the looping functionality can be alternately handled as async calls
- loop_particle : handles only the calls to the particle setup/teardown logic, as well as iteratively calling the step_particle functionality
- exhaust_active_bank : as it says, loops particles from the active bank until it is exhausted
- source_closeout : handles closeout of the loop_source function. This functionality is called only once by the gpu analog to loop_source
- source_dd_resolution: resolves domain decomposition
- loop_source : as it was, but with components separated
- gpu_sources_spec : generates a Harmonize program specification analogous to he exhaust_active_bank function
- gpu_loop_source: an alternate version of loop_source that uses the GPU program generated by gpu_sources_spec to perform transport on GPU
- generate_precursor_particle: analogous to generate_source_particle, but for the loop_precursor_particle functionality
- source_precursor_closeout: analogous to source_closeout, but for loop_precursor_particle
- loop_source precursor: as it was, but with components separated
- gpu_precursor_spec: analogous to gpu_sources_spec, but for loop_precursor_particle
- gpu_loop_spec: analogous to gpu_loop_source, but for loop_precursor_particle
Alignment of types:
- While CPU execution can robustly handle all sorts of Numba types, GPU execution requires structs to follow some of the basic properties expected of C-style structs with standard layout:
- Every primitive field is aligned by its size, and padding is inserted between fields to ensure alignment in arrays and nested data structures
- Every field has a unique address
- If these rules are violated, memory accesses made in GPUs may encounter problems. For example, in cases where an access is not at an address aligned by their size, a segfault or similar fault will occur, or information will be lost. These issues were fixed by providing a function, align, which ensures the field lists fed to np.dtype fulfill these requirements. This function does the following:
- Tracks the cumulative offset of fields as they appear in the input list.
- Inserts additional padding fields to ensure that primitive fields are aligned by their size
- Re-sizes arrays to have at least one element in their array (this ensure they have a non-zero size, and hence cannot overlap base addresses with other fields.
Adapters:
- The same sort of functionality may need to be implemented differently for CPU vs GPU. To these ends, the for_cpu, for_gpu, and toggle decorators were created. The first two decorators register inputs as for either CPU or GPU jit targets in numba. The last one replaces a function with an empty-bodied stand-in if the input toggle is set as False, thus avoiding compilation issues for functions that will only be called on CPU but which us functionality that would be an error for the GPU jit compiler.
- What happens to particles when they are "banked" changes depending upon which platform is being targeted, and upon the bank itself. Notably, adding to the active bank on GPU corresponds with the re-scheduling of the input particles for additional execution, rather than simply storing the particles into a bank. To enable automatic switching between the two, functions such as add_active and add_census were defined.
- Both GPUs and CPUs may have arrays as local variables, but they require calls to different functions to be created, and using the function for a CPU array is a compiler error for the GPU jit compiler (and vice-versa). To avoid this problem, functions such as local_particle and local_group_array are used to make local structs/arrays.

braxtoncuneo commented 4 months ago

There are a couple of issues that I would like to bring up before this is merged:

It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.
While domain decomposition still works with this refactor, it is now mysteriously slower vs the normal Python version. There's no telling how long hunting down the source of this slowdown will take, and I don't want to pause merges to the dev branch forever. If you are fine with merging this branch in now, I'll further investigate the slowdown in compiled DD afterwards.

braxtoncuneo commented 4 months ago

Out of curiosity, regarding data alignment in type_.py, what are the types that particularly need the alignment?

All types need to be aligned, but whether or not something needs to be done to align them is context dependent. For the sake of alignment, structs need to be laid out assuming that the base address is divisible by the largest alignment size we care about. From there, fields are laid out in sequence, in the order they appear in the list, laying out sub-structs recursively. By default, Numba packs all fields next to each other, with no additional alignment considerations.

An example of a case where padding is needed is an 8-byte field (A), followed by a 1-byte field (B), followed by an 8-byte field (C).

This is how Numba would lay it out in memory (each letter representing a byte): AAAAAAAABCCCCCCCC

This seems sensible, but then you notice that A and C cannot both be aligned to a base address divisible by 8. To ensure both are aligned, some padding must be provided: AAAAAAAAB.......CCCCCCCC

Padding like this (though of differing amounts) would be necessary for any combination of A and C with sizes greater than 1 byte.

Technically speaking 1-byte types could be considered as types that "don't care about alignment", but it would be more accurate to say that it is impossible to make them not aligned, since all addresses are divisible by 1.

braxtoncuneo commented 4 months ago

No rush. Just wanted to unblock it from my end, since Kayla gave the go-ahead and nobody in the slack seemed opposed.

ilhamv commented 4 months ago

It looks like some of the components of the global state struct have mismatching dimensions compared to the input deck's data. This is not an issue with GPU, but something pre-existing. Still, I wanted to bring it up since I've added in code that checks and reports some of these mismatches.

That is because some information, including how it is presented/structured, is relevant for the input interface, while others are relevant only in the simulation global state, and vice versa. The reconciliation particularly happens in prepare() in main.py.

ilhamv commented 4 months ago

Do we set a GitHub workflow to do the GPU regression test in this PR? If not, or not possible, what is the plan? @braxtoncuneo @jpmorgan98

jpmorgan98 commented 4 months ago

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

jpmorgan98 commented 4 months ago

I am setting up a github local runner on the CEMeNT dev machine we have at OSU. I might need admin privileges to get the host installed which will slow me down a bit but I don't think OSU COE IT should have too much of a problem helping me out. From there I think we can run whatever we want (CPU and Nvidia GPU runs) directly from the Github page.

I was thinking we could do some light performance testing per PR to make sure that a given PR wont slow down the code for GPUs or CPUs too much

Ok I got the runner up and going I am going to try and get harmonize to auto configure with MC/DC via the install script, add the proper runner then add a commit to this PR

braxtoncuneo commented 4 months ago

Strangely, Ilham's latest commit is failing in the CEMeNT repo but passing in the fork. I'm going to run the regression tests locally to try to figure out a cause.

jpmorgan98 commented 4 months ago

GPU regression testing is waiting on #196 to be resolved on the OSU CI machine. We should be able to run regression tests manually and locally for up coming PRs @ilhamv

CEMeNT-PSAAP / MCDC

GPU Interop #195