Open JBlaschke opened 3 years ago
I think this would be a good basis for more complex amrex types. Since torch and python don't have a standardize framework for expressing amr, this is (in my opinion) the lowers common denominator.
We should also keep in mind how we deal with boxes whose indices don't start at 0. @ax3l's box type already has what we need I think. So we might need to implement a thin wrapper around numpy and torch that map amrex-style indexing to python indices.
Also tagging @sayerhs
Thanks for starting a sticky thread so we can collect the approaches. Let me start with what I am using so far:
General arrays (incl. numpy):
either we code against the Python buffer protocol (scipy/PEP3118, Python.org)
or we understand the new __array_ufunc__
protocol and implement that NEP-13 - maybe that is unrelated
a large stack of software (cupy, numba, PyTorch, ... etc., see below) standarizes on the __cuda_array_interface__
convention (CUDA Array Interface v3)
or we code against xtensor[-python] for extra C++ niceness (cost: extra C++ dependency that we don't directly use here)
Device memory:
cupy
issue to ask how to do it right: https://github.com/cupy/cupy/issues/4644 - they also recommend to standardize on __cuda_array_interface__
- going directly to the emerging DLPack
APIsCompatibility:
cupy
/numba
: https://docs.cupy.dev/en/stable/reference/interoperability.htmlnumba
compatibility details for __cuda_array_interface__
v3Thanks @ax3l that list is a good starting point. I would vote for the python buffer protocol strategy as a starting point. This seems to work well PyCUDA also. We could then also implement some of the alternatives, depending on how much demand from applications there is, what benefits there are in each, and how much bandwidth we all have.
I'll do some reading to see if there is a benefit that would entice me to change my vote. (thanks for the references)
Agreed, I think after going through all the material again:
__cuda_array_interface__
v3 (C-example) for transporting device-side memory w/o host-device copiesto start with. This will give us exposure to exactly the libraries and communities we want to interface with.
Starting support for AMD GPUs (and Intel) in DLPack
(__dlpack__
):
FArrayBox
for CPU via the array interface is now implemented via #19.
Next is either the __cuda_array_interface__
or DLPack
. Should not be too hard to add both.
CUDA bindings for multifabs including cupy, numba and pytorch coming in via #30
Did some more DLPack deep diving with @scothalverson.
What we want to implement here is primarily the producer, __dlpack__
. This one creates a PyCapsule, essentially a transport of a void*
. The data behind this pointer is laid out in the spec of DLPack (C/Python).
Relatively easy to read implementations are:
More involved or less documented are:
The DLManagedTensor
is essentially:
This object is referred to in the capsule we produce.
Hey, this is not so much an issue as a place to solicit public feedback.
I think we should implement type conversion from the amrex FArrayBox (or more precisely the Array4) data type to numpy.ndarray and torch.tensor. As well a suitable python CUDA variants.
I also think that this type conversion should have s copying and a referencing variant.
This shouldn't be hard to implement (NO! This won't support python 2... I have a life you know), and I volunteer my time. But first I want to run this past all y'all to see if anyone is already working on it and what you think.
Tagging @ax3l @maxpkatz @drummerdoc