Add Cooperative Group for GPU-STUMP

seanlaw commented 2 months ago

Several years ago, we considered (see #266) adding a variant of GPU-STUMP that utilized cooperative groups and that would allow us to push the multiple kernel launches onto the device. Earlier work was concerned about:

Breaking backwards compatibility
Adding unnecessary complexity to the code

However, cudatoolkit support is much better now and older GPUs that lack cooperative group support are likely end-of-life (and so the above concerns are likely a thing of the pst now). Additionally, numba has moved ahead many, many versions since our last attempt. Thus, we should reconsider adding this to STUMPY. PR #266 provides some clear code for how to proceed and had demonstrated a 12% speedup, which is great!

See also the numba docs on cooperative groups

joehiggi1758 commented 1 month ago

@seanlaw I'd be happy to give this one a shot, no promises as it looks a bit complex, but would love to try!

seanlaw commented 1 month ago

@joehiggi1758 Do you have access to an NVIDIA GPU for testing? Otherwise, it might be very painful to assess the performance of any code changes. If you do then please proceed and let me if you have any questions or we can also reach out to our collaborator at NVIDIA for help as well (I'm sure there are new features that we may be able to leverage).

Alternatively, you may be interested in this new issue #1031 and attempting to reproduce the work. It has less baggage than this current issue.

joehiggi1758 commented 1 month ago

Hey @seanlaw hope you're having a great Saturday!

Unfortunately I don't have access to a GPU other than maybe a free subscription to Azure. I think starting with the NVIDIA contact is a better plan of attack! If I can help there in any way lmk, I'd love to more about GPUs!

I'll focus on #1031 for now as you're right that does look a bit better as a next issue for me!

TDAmeritrade / stumpy

Add Cooperative Group for GPU-STUMP #1029