braxtoncuneo commented 2 months ago

This pull request adds GPU execution on via ROCM through Harmonize. It also contains adjustments to match changes in the Harmonize interface.

Significant changes

Rethinking Arrays

As mentioned a few weeks back, Numba arrays are a bit funky. Their lifetime (the span of time they have an exclusive claim over their memory) does not persist into function calls they haven't been passed into, even if references to their content are passed in.

This is an artifact of how Numba manages arrays. Whether or not an array can be deallocated is managed through reference counting, where an incref function is applied to the array whenever a reference to it is copied and a decref function is applied whenever a reference is dropped. If the number of decref calls comes to match the number of incref calls, then (at least to Numba) there are no more references to that array and it can be deallocated.

This, of course, ignores any references to things contained by that array, leading to a bunch of hazard and uncertainty in what data we can and cannot trust.

This has necessitated quite a few changes to how we work with our variables.

Passing Records as single-element arrays

Essentially any non-primitive variable, (Records, Arrays, etc) can only be created by creating arrays, and so the only way to reliably use their storage is to pass along references to their containing array and then extracting out the element in each function body.

I've attempted to do this in the least obnoxious way possible, but there are only so many options available that fall within these constraints. While these lifetime issues are more likely to be observed on GPU, this is a problem on every platform and is baked into Numba itself.

Until Numba can account for element lifetimes, or until StructRef is both stabilized and available on GPU, this may be the best we can do.

Please note that there are a few "gotchas" to look out for. Namely, if you need to overwrite a particle variable so that it references different storage, you must update both the array and the element variable.

For example, this would result in incorrect behavior:

P_arr = adapt.local_array(1,type_.particle)
P = P_arr[0]

if (condition):
    P = bank[index]

Instead, you would need to do something like this:

P_arr = adapt.local_array(1,type_.particle)
P = P_arr[0]

if (condition):
    P_arr = bank[index:(index+1)]
    P = P_arr[0]

The `local_array` intrinsic

Harmonize now provides a local_array intrinsic which switches between the implementations of np.empty, cuda.local.array, and hip.local.array, depending upon the compilation context. This patches over a feature deficit in Numba, which prevents us from being able to rely upon the validity of array content without having those arrays declared in that function or passed in as an argument.

To allow Harmonize as an optional dependency, the CPU array creation function (np.empty) is implemented in the adapt module as a local_array intrinsic, and the full implementation is substituted if Harmonize is available.

An example of its usage:

array = local_array(1234,numpy_type_here)

To make sure this function works consistently across platforms, local_array has all the limitations of all implementations. Hence, only literals may be supplied as the array shape, and none of the array's content is actually initialized. This means that elements need to be explicitly zeroed after the fact, if subsequent code relies upon it. Luckily, all implementations support tuple literals as well, so N-D arrays should still be available.

Removal of `local_`... functions in `adapt`

With the advent of local_array, and with the revelation of how Numba manages lifetimes, the current local_... functions are both not needed and represent a potential avenue for bugs. This PR gets rid of them.

Adding `local_array` to Numba

I'm setting up a local clone of Numba to put together a PR to add local_array to both the main repo and the AMD fork. This would remove the burden of maintaining the intrinsic in MCDC and Harmonize.

The `leak` intrinsic

To allow us to reliably use mcdc and data without passing it around everywhere as single element arrays, we need to prevent Numba from thinking we are no longer using the arrays containing those records.

The leak intrinsic essentially calls incref, but supplies no corresponding decref. With any luck (assuming Numba doesn't do anything that strips extra incref calls), calling leak on the arrays containing mcdc and data should gurantee that the arrays will never deallocate. This sort of trick would not work for other variables, which we don't want to last for the entire program lifetime, but it should work for anything like mcdc or data.

Using Harmonize's `array_atomic_add` and `array_atomic_max` intrinsics

Since the AMD fork of Numba does not support atomics, we had to implement our own. Without getting into the grizzly details of how that's accomplished in Harmonize, we now have a set of magic functions that do that for us.

Fixes for Parallel Additions to Tallies

Recent pull requests to dev have additions to tallies that used += instead of global_add. While this works on CPU, the global_add function is provided to switch from normal += to an atomic addition, which is necessary for tallies from GPUs to add up correctly. Without it, the additions made by one thread can be overwritten by another thread, leading to difficult-to-track-down bugs.

This is partly on me, since I should have reviewed that PR to double check how tallies were handled, but that tiny discrepancy took three days to diagnose. For future reference, any tally accumulations performed through global_add should remain that way when refactoring.

Updates to .gitignore

Since Harmonize now no longer simply caches ptx, the corresponding cache directory name has been updated to simply __harmonize_cache__. The .gitignore was updated to ignore folders with this name.

Additional command-line options

To allow the users of MCDC to better control execution on GPUs, the following options have been added:

--gpu_strat: the execution strategy used during GPU execution, with async and event as possible values
--gpu_arena_size: the size of buffer used to store intermediate data, in terms of how many promises can be stored. This incurs more proportional memory for the event strategy, since it uses two buffers for each event type, and async uses only one buffer.
--gpu_block_count: the number of blocks (aka work groups) which each kernel launch uses
--gpu_rocm_path: path to the ROCm installation to use
--gpu_cuda_path: path to the CUDA installation to use

Platform and Path Auto-detection

Since CUDA execution requires normal Numba and AMD execution requires the HIP fork, Harmonize checks which is available as numba and automatically switches between the two without any input required.

If execution is on AMD, Harmonize will try to find all of the paths on its own. This still requires you to run module load <version of rocm here>, but Harmonize will look at the hipcc introduced by that load and trace back where the rest of the installation is. If all of the requisite programs are where they should be, no other input should be required, otherwise, the path to the ROCm installation should be set.

On Regression Tests

All regression tests pass for the mean results, but not the standard deviation, for all non-eigenvalue problems. This makes sense, given that GPUs perform sampling all in one batch, whereas CPUs do it in a set of smaller batches. The close matching in the mean should mean that the simulated events correspond very closely on CPU and GPU.

Non-reproducible Population Control

Eigenvalue problems currently do not reproduce because population control is not yet deteministic for GPUs. This is because:

Population control currently relies upon deterministic ordering of particles in census banks
Parallelism requires us to be able to process particles independently, and so we cannot guarantee the ordering that particles are added to the bank

Evidence

You can verify this by checking the results of eigenvalue problems after only one iteration and compare it with multi-iteration output. Output matches on one, but not after the second, but the outputs match very closely, indicating that the simulation is still "correct" in most senses.

Fixing Population Control

To make population control deterministic again, we'll need to add on a deterministic particle binning process. By that, I mean we need a way to deterministically select exactly N particles from a set, where the set of particles is deterministic, but the order we encounter those particles is not.

I've already sketched out an algorithm for this a few years ago (you may recall it, @ ilhamv), but I think this is something that would be better to put in a later PR.

Updates to Examples

Some example problems were broken by the switch to CSG. This PR includes some fixes.

New Minimum Version Requirement (Harmonize)

This latest PR will require a fresh version of Harmonize. This version is currently accessible through the amd_event_interop_revamp branch. Once this PR is accepted, this branch will be merged into main and the old version of main will be split off into a legacy branch for convenience when working with older versions.

braxtoncuneo commented 2 months ago

There's still one unit test failing, but that should (hopefully) resolve quickly. I think this PR is at least far enough along to open up discussion on the changes.

One fix that will be included in future commits to this PR is the naming conventions we use for accessors and types. I currently have it set up where accessing mcdc on gpu via the program handle is mcdc_constant. It was brought up at the most recent meeting that his wasn't the most applicable, so let me know what name you prefer to use in the accessor.

Also, I believe @ilhamv suggested tally as the name of the record type that defines the data variable that is passed around. That change is on also on the docket for inclusion in the PR.

braxtoncuneo commented 1 month ago

Additional Feature: MPI+GPU Operability

This latest set of commits adds MPI+GPU operability. Currently, this only works by using one GPU per node, but this can be fixed later, after we have discussed how we'd like to organize our ranks.

braxtoncuneo commented 1 month ago

This latest commit has revised the accessor for mcdc to mcdc_global and revised the name of the type of data to tally, as previously suggested.

ilhamv commented 1 month ago

Thanks @Braxton!

For the non-reproducible population control on the eigenvalue problem, we can follow the suggestion from this paper where we essentially do a bookkeeping on the particle parent ID and progeny number and then followed by an efficient sorting algorithm at the end of each batch/cycle.

ilhamv commented 1 month ago

I made some cosmetic edits: particle_arr --> particle_container (feels more physical).

I removed

the old local array creation (local.py and code_factory.py).
predefined maximum RPN buffer (now it is allocated as needed, that is, the respective size of the region tokens)

@braxtoncuneo , can you please remind me why the GPU mode can't print()?

braxtoncuneo commented 1 month ago

I made some cosmetic edits: particle_arr --> particle_container (feels more physical).

I removed
* the old local array creation (`local.py` and `code_factory.py`).

* predefined maximum RPN buffer (now it is allocated as needed, that is, the respective size of the region tokens)
@braxtoncuneo , can you please remind me why the GPU mode can't print()?

The AMD fork of Numba does not support print, and the process of implementing a print comparable to that available on CUDA looks non-trivial. If the solution is anything like the CUDA Numba implementation, it will require making an intrinsic that generates a custom call to HIP's version of the CUDA vprintf function for every combination of input types used in the print function in the program. It's on the to-do list, but it will take time.

I did add in a limited form of print in harmonize*, but it resulted in a linking error in one out of maybe 100 print statements, and the conditions that cause those errors are still an open question. Even very subtle changes in seemingly unrelated functions could cause the bug to occur.

*If you would like to use it, it's accessible as harmonize.print_formatted. It accepts a single float/integer value and prints it in parentheses.

ilhamv commented 1 month ago

Didn't know that the creation of a local array needs to be literally sized...

The last force commit removes the immediate previous commit.

CEMeNT-PSAAP / MCDC

AMD GPU (ROCM) Interoperability #241

Significant changes

Rethinking Arrays

Passing Records as single-element arrays

The `local_array` intrinsic

Removal of `local_`... functions in `adapt`

Adding `local_array` to Numba

The `leak` intrinsic

Using Harmonize's `array_atomic_add` and `array_atomic_max` intrinsics

Fixes for Parallel Additions to Tallies

Updates to .gitignore

Additional command-line options

Platform and Path Auto-detection

On Regression Tests

Non-reproducible Population Control

Evidence

Fixing Population Control

Updates to Examples

New Minimum Version Requirement (Harmonize)

Additional Feature: MPI+GPU Operability

CEMeNT-PSAAP / MCDC

AMD GPU (ROCM) Interoperability #241

Significant changes

Rethinking Arrays

Passing Records as single-element arrays

The local_array intrinsic

Removal of local_... functions in adapt

Adding local_array to Numba

The leak intrinsic

Using Harmonize's array_atomic_add and array_atomic_max intrinsics

Fixes for Parallel Additions to Tallies

Updates to .gitignore

Additional command-line options

Platform and Path Auto-detection

On Regression Tests

Non-reproducible Population Control

Evidence

Fixing Population Control

Updates to Examples

New Minimum Version Requirement (Harmonize)

Additional Feature: MPI+GPU Operability

The `local_array` intrinsic

Removal of `local_`... functions in `adapt`

Adding `local_array` to Numba

The `leak` intrinsic

Using Harmonize's `array_atomic_add` and `array_atomic_max` intrinsics