Open gevtushenko opened 1 month ago
Keeping track of a link I got from Georgii: https://github.com/NVIDIA/cccl/pull/2335
Tracking for easy future reference:
https://numba.pydata.org/numba-doc/dev/user/jitclass.html
Limitations
I'm currently doing work on this branch:
https://github.com/rwgk/cccl/tree/python_random_access_iterators
Last commit https://github.com/rwgk/cccl/commit/d1c4816f8f3391c97e6fd32a89d45785615f6ea1 — Use TransformRAI to implement constant, counting, arbitrary RAIs.
Current thinking:
op
here is processed: https://github.com/NVIDIA/cccl/blob/e5229f2c7509ced5a830be4ae884d9e1639e8951/python/cuda_parallel/cuda/parallel/experimental/__init__.py#L259C30-L259C32Some points for the upcoming meeting, to capture the state of my understanding:
We have a Python side and a C++ side, each with their conventions, standard approaches, and standard terminology.
We want to bridge between them.
On the C++ side we want to feed into cub::DeviceReduce
, which requires a C++ random access iterator (input). — Actual calls in Georgii's existing code: cub::DeviceReduceSingleTileKernel
, cub::DeviceReduceKernel
What Python API we want exactly is TBD. But we can figure out the core requirement which is simply:
scalar_with_certain_dtype = RAI_object[random_access_array_index]
The RAI_object
is limited in the same way as any Python function compiled withnumba.cuda.jit
, with nopython
. — I.e. we don't want to run Python code on the GPU.
RAI_object[random_access_array_index]
needs to run in constant time: https://en.cppreference.com/w/cpp/iterator/random_access_iterator
That code formats C++ code to be compiled with nvrtc
a little further down.
At this stage I'm very conservative: I'm aiming for minimal changes to Georgii's code to get the desired behavior passing
repeat()
/ C++ thrust::constant_iterator
,count()
/ C++ thrust::counting_iterator
,thrust::transform_iterator
in C++.Non-goal at this stage: cache-modified C++ input iterator. — I still needs to learn what exactly is needed. In a follow-on step (to the work above), I want to solve this also in a conservative fashion.
When that is done I want to take a step back to review:
User-facing Python API. — Ideally with customers involved.
Do we want to continue to work with ctypes
?
Do we want to build the nvrtc
input strings in C++, or could we use pynvrtc
?
Tracking progress:
I just created this Draft PR: https://github.com/NVIDIA/cccl/pull/2595, currently @ commit 5ba7a0f413123cc05e6eb9f3690e8b571659c670
Copy-pasting the current PR description:
Goal: Any unary_op(distance)
that can be compiled by numba can be passes as reduce_into(d_in)
Current status:
test_reduce.py input_generator
: ["constant", "counting", "arbitrary", "nested_inner"]
tests PASS.
input_generator
: "nested_global"
test FAILS because numba.cuda.compile()
fails for this code:
def other_unary_op(distance):
permutation = (4, 2, 0, 3, 1)
return permutation[distance % len(permutation)]
def input_unary_op(distance):
return 2 * other_unary_op(distance)
The error is:
E numba.core.errors.TypingError: Failed in cuda mode pipeline (step: nopython frontend)
E Untyped global name 'other_unary_op': Cannot determine Numba type of <class 'function'>
E
E File "tests/test_reduce.py", line 117:
E def input_unary_op(distance):
E return 2 * other_unary_op(distance)
E ^
Let's take a look at some of the API that we need in cuda.parallel.itertools
.
Below you can find a proof-of-concept implementation for count, repeat, map, and cache.
num_items = 4
output_array = numba.cuda.device_array(num_items, dtype=np.int32)
# Returns 42 N times
r = repeat(42)
parallel_algorithm(r, num_items, output_array)
print("expect: 42 42 42 42; get: ", " ".join([str(x) for x in output_array.copy_to_host()]))
# Returns an integer sequence starting at 42
c = count(42)
parallel_algorithm(c, num_items, output_array)
print("expect: 42 43 44 45; get: ", " ".join([str(x) for x in output_array.copy_to_host()]))
# Multiplies 42 (coming from repeat) by 2
def mult(x):
return x * 2
mult_42_by_2 = map(r, mult)
parallel_algorithm(mult_42_by_2, num_items, output_array)
print("expect: 84 84 84 84; get: ", " ".join([str(x) for x in output_array.copy_to_host()]))
# Adds 10 to result of multiplication of repeat by 2
def add(x):
return x + 10
mult_42_by_2_plus10 = map(mult_42_by_2, add)
parallel_algorithm(mult_42_by_2_plus10, num_items, output_array)
print("expect: 94 94 94 94; get: ", " ".join([str(x) for x in output_array.copy_to_host()]))
# Same as above, but for count
mult_count_by_2 = map(c, mult)
parallel_algorithm(mult_count_by_2, num_items, output_array)
print("expect: 84 86 88 90; get: ", " ".join([str(x) for x in output_array.copy_to_host()]))
mult_count_by_2_and_add_10 = map(mult_count_by_2, add)
parallel_algorithm(mult_count_by_2_and_add_10, num_items, output_array)
print("expect: 94 96 98 100; get:", " ".join([str(x) for x in output_array.copy_to_host()]))
# Example of how combinational iterators can wrap pointer in a generic way
input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
parallel_algorithm(ptr, num_items, output_array)
print("expect: 4 3 2 1 ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))
input_array = numba.cuda.to_device(np.array([4, 3, 2, 1], dtype=np.int32))
ptr = pointer(input_array) # TODO this transformation should be hidden on the transform implementation side
tptr = map(ptr, mult)
parallel_algorithm(tptr, num_items, output_array)
print("expect: 8 6 4 2 ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))
# Example of caching iterator
streamed_input = cache(input_array, 'stream')
parallel_algorithm(streamed_input, num_items, output_array)
print("expect: 4 3 2 1 ; get:", " ".join([str(x) for x in output_array.copy_to_host()]))
Before this proof-of-concept is merged, we have to address the following:
dtype
parameter to each of the API above, something along the lines of repeat(42, value_type=numba.int32)
(maybe call it dtype
for consistency). With this, we'll be able to extend proof-of-concept to all primitive types supported by numba. This will potentially unblock implementation of zip
iterator.dtype
. Without this, we'll have a problem whith algorithms accpeting more than one input iterator (like reduce by key or merge sort pairs). The issue will happen if merge_pairs(map(count(0, value_type=int32)), map(count(0, value_type=float32)))
. Without mangling symbol names, dereferencing member functions of count(int32) and count(float64) would have the same symbol name count_dereference
. This would lead to incorrect results or linkage failure. Future work that can be addressed after initial support of primitive types:
map
has to support stateful operators. This means that we could capture a runtime value, say an int or a pointer and use it inside the unary operator. After https://github.com/NVIDIA/cccl/issues/2538 is addressed, API might look like map(count(0), lambda i,d_input=d_input: return d_input[i])
. In this example, d_input
is a runtime parameter. This would require passing an array of state pointers and forming a single buffer, like we do in device for today (https://github.com/NVIDIA/cccl/blob/084cd536f1e42eb5078733458951a7b957aa1487/c/parallel/src/for/for_op_helper.cpp#L201-L242).dtype
of iterators to be anything that Numba understands as opposed to limiting this set to what ctypes knows about. Discussion items for sync when Georgii returns from PTO:
https://github.com/rwgk/cccl/tree/georgii_poc_2479
Notes
with commands for how to build and runassert_expected_output_array("84 84 84 84")
https://github.com/rwgk/numba/tree/misc_doc_fixes — By-product of systematically working through Numba documentation top-to-bottom.
georgii_poc_2479
main.py
@intrinsic
is used in ldcs()
and map()
ldcs = ir.InlineAsm(ldcs_type, "ld.global.cs.b32 $0, [$1];", "=r, l")
map(it, op)
is the reverse of Python's map(func, *iterables)
@register_jitable
fixes the kind of error I asked about on October 17 (PR #2595):
@register_jitable
(tiny change)map(func, rai)
implementation: works with plain Python and as input for reduce_into(d_in)
Unanswered question slack cccl-tm: @numba.cuda.reduce
vs cub::DeviceReduce
Why is the cuda.cooperative implementation so different (is it?) from cuda.parallel?
.so
libcccl.c.parallel.so
@intrinsic
works? What is a good starting point?Is there anything we can use or learn from cupyx.jit.cub, cupyx.jit.thrust?
I managed to combine Georgii's POC ConstantIterator
and CountingIterator
with the code I had under #2595:
It's pretty simple, and I got most of the way there pretty quickly, too, but then it took me several hours to get my head around the handling of the cccl_iterator_t::state
.
(Next I'll try to plug in the POC map()
function.)
Georgii's POC map()
function is now also integrated into the fancy_iterators
branch (under the name cu_map()
, to not shadow a Python built-in function). All existing tests and the new cu_map()
test pass at this commit:
I have a quick comment regarding map
.
I share concerns about shadowing Python builtins, but I think map
is singularly special in a functional library. It is likely to be heavily used in code, examples, and documentations. Hence, it is possible to reinforce the conventions for how to use map
.
I would prefer to keep the name map
if we can establish two conventions such as:
from cuda.parallel.iterators import map as cumap
and
import cuda.parallel.iterators as cuit
cuit.map(...)
Also, I prefer cumap
to cu_map
. We have cuml
, cudf
, cugraph
, cupy
, cucim
, etc. libraries. I don't see other functions with cu_
prefix.
Is this a duplicate?
Area
General CCCL
Is your feature request related to a problem? Please describe.
Compared to existing solutions in PyTorch and CuPy, one of the distinguishing features of cuda.parallel is flexibility. Part of that flexibility is coming from support of user-defined data types and operators. But compared to CUDA C++ solution, cuda.parallel API is still limited. We are still missing fancy iterators and cache modified iterators in cuda.parallel.
Describe the solution you'd like
Given that fancy iterators support might require rather invasive changes to cuda.parallel and CCCL/c libraries, we should design fancy iterators before introducing more algorithms to cuda.parallel.
Describe alternatives you've considered
No response
Additional context
No response