[FEA]: `shared_ptr` that can be used in kernel code

mrakgr commented 5 months ago

Is this a duplicate?

[X] I confirmed there appear to be no duplicate issues for this request and that I agree to the Code of Conduct

Area

libcu++

Is your feature request related to a problem? Please describe.

I asked around, but it I got a negative answer. My own search on this repo only showed shared_ptr being used in host code.

Basically, my problem is that I am working on a reference counting Cuda backend for Spiral and I am changing my mind that the ref counting work should be done by the Spiral compiler itself. If I had a shared_ptr class in kernel code I could compile recursive union types and various other data types so they use them. Right now, it'd be very easy to break the ref counting passes using macros, while shared_ptr would mesh well with those.

The intended purpose of this class would be specifically for data not being shared between threads. In other worse, for single threaded code.

One other motivation behind having this is to lower the compilation times taken by the Cuda compiler. Previously, I've created the NL Holdem game directly on the GPU and I suspect that making use of too many value types is making the compilation times increase exponentially.

Describe the solution you'd like

shared_ptr in device code seems like a good solution.

Describe alternatives you've considered

Currently I have my own ref counting pass in Spiral that was made for a C backend. Something like that would be the only choice in C and it makes sense there. It is also designed to play well with tail recursion. But so far, even though the Cuda compiler does support tail recursion, I had to rewrite the inner loop for the Leduc game into an imperative one as the tail recursive one kept stack overflowing, so that advantage doesn't matter there. Another issue with having an inbuilt ref counting pass is that the heap allocated types wouldn't be interoperable with C++ libraries. Again, this wouldn't matter with C as the language is too inexpressive to have libraries worth using, but C++ is different.

Additional context

No response

mrakgr commented 5 months ago

As an aside if you (the one writing the placeholder text) want discriminated union types and pattern matching, do check out Spiral. No need to bother with std::variant. What Spiral has is much better and compiles to Cuda C++ directly having full interop with its libraries.

mrakgr commented 5 months ago

Probably, by the time you guys get to this, I'll have implemented all the classes for the C++ backend manually, so I won't need this then, but I guess opening this issue is more of way of resolving my feelings regarding what I want. Right now, the new backend is a weird mix of C and C++. I'll try going full C++ and seeing where that gets me. I hope Cuda supports virtual functions.

pauleonix commented 5 months ago

@mrakgr CUDA does support virtual functions but under certain restrictions.

miscco commented 4 months ago

I have some experimental code that exposes more of <memory> within libcu++ like std::unique_ptr

However, that currently does not include std::shared_ptr and I am also highly skeptical of std::unique_ptr being the right thing

The reason is that memory safety is not trivial across heterogenous boundaries and we really want to make sure that we get the design right. As an example neither std::shared_ptr nor std::unique_ptr take a allocator or memory resource that specifies where the memory is allocated. is it on host is it shared memory?

That may be appropriate for the standard which assumes homogeneous memory systems, but it is a bad default for our use case.

We are currently in the process of designing a cuda::vector that addresses these problems and I believe once we are happy with the design we should be able to easily adopt it for all the smart pointers

mrakgr commented 4 months ago

Link: https://github.com/mrakgr/The-Spiral-Language/blob/master/The%20Spiral%20Language%202/reference_counting.cuh

I did them like this in Spiral. That having said, I haven't run into a use case for them apart from implementing a ref counting backend yet.

mrakgr commented 4 months ago

They are intended for single threaded use, and since Spiral compiles to Python on the host, where they're allocated isn't something that I needed to think about.

NVIDIA / cccl