Closed braxtoncuneo closed 1 month ago
There's still one unit test failing, but that should (hopefully) resolve quickly. I think this PR is at least far enough along to open up discussion on the changes.
One fix that will be included in future commits to this PR is the naming conventions we use for accessors and types. I currently have it set up where accessing mcdc
on gpu via the program handle is mcdc_constant
. It was brought up at the most recent meeting that his wasn't the most applicable, so let me know what name you prefer to use in the accessor.
Also, I believe @ilhamv suggested tally
as the name of the record type that defines the data
variable that is passed around. That change is on also on the docket for inclusion in the PR.
This latest set of commits adds MPI+GPU operability. Currently, this only works by using one GPU per node, but this can be fixed later, after we have discussed how we'd like to organize our ranks.
This latest commit has revised the accessor for mcdc
to mcdc_global
and revised the name of the type of data
to tally
, as previously suggested.
Thanks @Braxton!
For the non-reproducible population control on the eigenvalue problem, we can follow the suggestion from this paper where we essentially do a bookkeeping on the particle parent ID and progeny number and then followed by an efficient sorting algorithm at the end of each batch/cycle.
I made some cosmetic edits: particle_arr
--> particle_container
(feels more physical).
I removed
local.py
and code_factory.py
).@braxtoncuneo , can you please remind me why the GPU mode can't print()
?
I made some cosmetic edits:
particle_arr
-->particle_container
(feels more physical).I removed
* the old local array creation (`local.py` and `code_factory.py`). * predefined maximum RPN buffer (now it is allocated as needed, that is, the respective size of the region tokens)
@braxtoncuneo , can you please remind me why the GPU mode can't
print()
?
The AMD fork of Numba does not support print
, and the process of implementing a print
comparable to that available on CUDA looks non-trivial. If the solution is anything like the CUDA Numba implementation, it will require making an intrinsic that generates a custom call to HIP's version of the CUDA vprintf
function for every combination of input types used in the print
function in the program. It's on the to-do list, but it will take time.
I did add in a limited form of print in harmonize*, but it resulted in a linking error in one out of maybe 100 print statements, and the conditions that cause those errors are still an open question. Even very subtle changes in seemingly unrelated functions could cause the bug to occur.
*If you would like to use it, it's accessible as harmonize.print_formatted
. It accepts a single float/integer value and prints it in parentheses.
Didn't know that the creation of a local array needs to be literally sized...
The last force commit removes the immediate previous commit.
This pull request adds GPU execution on via ROCM through Harmonize. It also contains adjustments to match changes in the Harmonize interface.
Significant changes
Rethinking Arrays
As mentioned a few weeks back, Numba arrays are a bit funky. Their lifetime (the span of time they have an exclusive claim over their memory) does not persist into function calls they haven't been passed into, even if references to their content are passed in.
This is an artifact of how Numba manages arrays. Whether or not an array can be deallocated is managed through reference counting, where an
incref
function is applied to the array whenever a reference to it is copied and adecref
function is applied whenever a reference is dropped. If the number ofdecref
calls comes to match the number ofincref
calls, then (at least to Numba) there are no more references to that array and it can be deallocated.This, of course, ignores any references to things contained by that array, leading to a bunch of hazard and uncertainty in what data we can and cannot trust.
This has necessitated quite a few changes to how we work with our variables.
Passing Records as single-element arrays
Essentially any non-primitive variable, (Records, Arrays, etc) can only be created by creating arrays, and so the only way to reliably use their storage is to pass along references to their containing array and then extracting out the element in each function body.
I've attempted to do this in the least obnoxious way possible, but there are only so many options available that fall within these constraints. While these lifetime issues are more likely to be observed on GPU, this is a problem on every platform and is baked into Numba itself.
Until Numba can account for element lifetimes, or until
StructRef
is both stabilized and available on GPU, this may be the best we can do.Please note that there are a few "gotchas" to look out for. Namely, if you need to overwrite a particle variable so that it references different storage, you must update both the array and the element variable.
For example, this would result in incorrect behavior:
Instead, you would need to do something like this:
The
local_array
intrinsicHarmonize now provides a
local_array
intrinsic which switches between the implementations ofnp.empty
,cuda.local.array
, andhip.local.array
, depending upon the compilation context. This patches over a feature deficit in Numba, which prevents us from being able to rely upon the validity of array content without having those arrays declared in that function or passed in as an argument.To allow Harmonize as an optional dependency, the CPU array creation function (
np.empty
) is implemented in theadapt
module as alocal_array
intrinsic, and the full implementation is substituted if Harmonize is available.An example of its usage:
To make sure this function works consistently across platforms,
local_array
has all the limitations of all implementations. Hence, only literals may be supplied as the array shape, and none of the array's content is actually initialized. This means that elements need to be explicitly zeroed after the fact, if subsequent code relies upon it. Luckily, all implementations support tuple literals as well, so N-D arrays should still be available.Removal of
local_
... functions inadapt
With the advent of
local_array
, and with the revelation of how Numba manages lifetimes, the currentlocal_
... functions are both not needed and represent a potential avenue for bugs. This PR gets rid of them.Adding
local_array
to NumbaI'm setting up a local clone of Numba to put together a PR to add
local_array
to both the main repo and the AMD fork. This would remove the burden of maintaining the intrinsic in MCDC and Harmonize.The
leak
intrinsicTo allow us to reliably use
mcdc
anddata
without passing it around everywhere as single element arrays, we need to prevent Numba from thinking we are no longer using the arrays containing those records.The
leak
intrinsic essentially callsincref
, but supplies no correspondingdecref
. With any luck (assuming Numba doesn't do anything that strips extraincref
calls), callingleak
on the arrays containingmcdc
anddata
should gurantee that the arrays will never deallocate. This sort of trick would not work for other variables, which we don't want to last for the entire program lifetime, but it should work for anything likemcdc
ordata
.Using Harmonize's
array_atomic_add
andarray_atomic_max
intrinsicsSince the AMD fork of Numba does not support atomics, we had to implement our own. Without getting into the grizzly details of how that's accomplished in Harmonize, we now have a set of magic functions that do that for us.
Fixes for Parallel Additions to Tallies
Recent pull requests to dev have additions to tallies that used
+=
instead ofglobal_add
. While this works on CPU, theglobal_add
function is provided to switch from normal+=
to an atomic addition, which is necessary for tallies from GPUs to add up correctly. Without it, the additions made by one thread can be overwritten by another thread, leading to difficult-to-track-down bugs.This is partly on me, since I should have reviewed that PR to double check how tallies were handled, but that tiny discrepancy took three days to diagnose. For future reference, any tally accumulations performed through
global_add
should remain that way when refactoring.Updates to .gitignore
Since Harmonize now no longer simply caches
ptx
, the corresponding cache directory name has been updated to simply__harmonize_cache__
. The.gitignore
was updated to ignore folders with this name.Additional command-line options
To allow the users of MCDC to better control execution on GPUs, the following options have been added:
--gpu_strat
: the execution strategy used during GPU execution, withasync
andevent
as possible values--gpu_arena_size
: the size of buffer used to store intermediate data, in terms of how many promises can be stored. This incurs more proportional memory for theevent
strategy, since it uses two buffers for each event type, andasync
uses only one buffer.--gpu_block_count
: the number of blocks (aka work groups) which each kernel launch uses--gpu_rocm_path
: path to the ROCm installation to use--gpu_cuda_path
: path to the CUDA installation to usePlatform and Path Auto-detection
Since CUDA execution requires normal Numba and AMD execution requires the HIP fork, Harmonize checks which is available as
numba
and automatically switches between the two without any input required.If execution is on AMD, Harmonize will try to find all of the paths on its own. This still requires you to run
module load <version of rocm here>
, but Harmonize will look at thehipcc
introduced by that load and trace back where the rest of the installation is. If all of the requisite programs are where they should be, no other input should be required, otherwise, the path to the ROCm installation should be set.On Regression Tests
All regression tests pass for the mean results, but not the standard deviation, for all non-eigenvalue problems. This makes sense, given that GPUs perform sampling all in one batch, whereas CPUs do it in a set of smaller batches. The close matching in the mean should mean that the simulated events correspond very closely on CPU and GPU.
Non-reproducible Population Control
Eigenvalue problems currently do not reproduce because population control is not yet deteministic for GPUs. This is because:
Evidence
You can verify this by checking the results of eigenvalue problems after only one iteration and compare it with multi-iteration output. Output matches on one, but not after the second, but the outputs match very closely, indicating that the simulation is still "correct" in most senses.
Fixing Population Control
To make population control deterministic again, we'll need to add on a deterministic particle binning process. By that, I mean we need a way to deterministically select exactly N particles from a set, where the set of particles is deterministic, but the order we encounter those particles is not.
I've already sketched out an algorithm for this a few years ago (you may recall it, @ ilhamv), but I think this is something that would be better to put in a later PR.
Updates to Examples
Some example problems were broken by the switch to CSG. This PR includes some fixes.
New Minimum Version Requirement (Harmonize)
This latest PR will require a fresh version of Harmonize. This version is currently accessible through the
amd_event_interop_revamp
branch. Once this PR is accepted, this branch will be merged intomain
and the old version ofmain
will be split off into alegacy
branch for convenience when working with older versions.