[REFACTOR] cuda.parallel: Don't require passing input/output arrays to `reduce_into` and similar algorithms

shwina commented 1 day ago

Currently, reduce_into usage looks like:

# construct the reduer:
reducer = cudax.reduce_into(d_in, d_out, op, h_init)

# allocate temp storage
temp_storage_bytes = reducer.reduce_into(None, d_in, d_out, op, h_init)
d_temp = cuda.device_array(temp_storage_bytes)
result = reducer.reduce_into(d_temp, d_in, d_out, op, h_init)

Note that the initial construction of reducer shouldn't strictly need the arguments d_in, d_out, h_init. In fact, passing any placeholder arrays of the same data type will serve the same purpose.

We should refactor such that only the required information is passed into the constructor.

Additional Context

https://github.com/NVIDIA/cccl/pull/3001/#discussion_r1866733390

leofang commented 15 hours ago

I was under the impression that d_in/d_out could be iterators too, in which case reduce_info would need to know them (as part of the problem definition) for later codegen?

rwgk commented 15 hours ago

I was under the impression that d_in/d_out could be iterators too,

Yes. The API I have right now (in #2788) is hack-ish, just enough for full testing. I want to discuss with @shwina (today) what the iterator API should look like.

NVIDIA / cccl

[REFACTOR] cuda.parallel: Don't require passing input/output arrays to `reduce_into` and similar algorithms #3008

Additional Context