NVIDIA / cccl

CUDA Core Compute Libraries
https://nvidia.github.io/cccl/
Other
1.32k stars 168 forks source link

Should `any_resource` be copyable? #2379

Closed harrism closed 2 months ago

harrism commented 3 months ago

I don't think any_resource should support copy construction or copy assignment.

What does it mean to copy a memory resource? A memory resource may own outstanding allocations to a LOT of memory. In some cases (pool memory resources), it may have allocated nearly all of the memory on a device, or in the system. What does it mean to copy that pool memory resource?

What if the copy is used to deallocate. Presumably the implementation of the pool doesn't know how to update the free lists of the original copy.

What is the intended use of copying any_resource?

CC @wence-

jrhemstad commented 2 months ago

I recall discussing this with @ericniebler and the TL;DR is that an any_resource is copyable only if the resource it holds is also copyable.

I believe attempting to invoke the copy of any_resource for a non-copyable upstream will cause an exception to be thrown.

ericniebler commented 2 months ago

not quite. we want our containers to own their memory resources just as STL containers own their allocators. also, we want our containers to be movable and copyable, and so the memory resource must also be movable and copyable.

which brings us to the question of: what does it mean to copy a memory resource? it doesn't mean somehow trying to clone its state. from the type- and value-theoretic point of view, a memory resource doesn't have any observable state; to wit, the resource concept doesn't have any accessors that let you inspect its state. from that perspective, memory resources are immutable objects. when you have immutable objects, you get referential transparency and value semantics for free.

this is a round-about way of saying that we can have copyable memory resources by simply ref-counting them. we should provide such a ref-counting resource wrapper to make this easy. that is what you would pass to the container's constructor.

but note, you don't always need to dynamically allocate and ref-count your memory resource. if you understand the lifetimes of the container and of the memory resource you'd like it to use, you can construct the container with a resource_ref that points to the resource. referential transparency makes this use sound as well.

besides, our most commonly used memory resources are actually empty, trivially-copyable, "flyweight" objects that delegate to the CUDA runtime.

all of this presupposes that the memory resource is thread-safe. if it's not, you'll have to wrap it in something that makes it thread-safe, or else be damn sure that it's not getting accessed concurrently.

EDIT: this is making me think that we should require that allocate and deallocate are callable on const-qualified memory resources. attn: @pciolkosz @miscco

jrhemstad commented 2 months ago

@ericniebler I agree with everything you said, but want to clarify this point:

our most commonly used memory resources are actually empty, trivially-copyable, "flyweight" objects that delegate to the CUDA runtime.

This shouldn't be the use case we optimize for. The resource we want most people to use is a memory pool, which means it has state and won't be copyable.

harrism commented 2 months ago

we want our containers to own their memory resources just as STL containers own their allocators.

Who is "we", and why do they want this? Maybe I don't understand what "we" mean by "own". In RMM, our containers do not own their MRs. They hold a reference to their MRs.

Agree with @jrhemstad , in RAPIDS, our most commonly used memory resources are definitely not flyweight objects.

ericniebler commented 2 months ago

if the containers store the memory resource by reference, that's the only option you have. if you store it by value, then you can pass a value-ish memory resource, or a reference-ish resource, or a ref-counted resource. it gives the user more control. rmm would pass a resource_ref when constructing a container to store the resource by reference.

harrism commented 2 months ago

So what actually does happen when one creates an any_resource from an instance of a non-copyable, non-movable memory resource class?

miscco commented 2 months ago

I would like to reiterate what the goals for the new cuda APIs are

  1. easy to use by default
  2. easy to get optimal performance

Point 1. means for any container we need to bind the lifetime of the underlying memory resource to the container. Otherwise user will frequently fail to account for it and will run into segfaults.

That is the reason we require cudax::uninitialized_buffer and cudax::vector to own their memory resources through any_resource. At the same time we want to be able to copy vectors which again goes into the "easy to use by default".

That means that an any_resource must be able to copy its underlying resource. There is simply no generic reason why that would be disallowed.

Point 2. Means that we give the user the ability to make the safe default as performant as possible. You can have a look at my tests for cudax::vector. There I use a caching resource to avoid frequent small allocations on device.

Its not a especially large resource but copying it would be rather stupid. For that reason the resource is not passed by value, but through a resource_ref. As we discussed two week ago, this allows the user to safely share a resource as the container would copy the resource_ref which is free and not the resource which might be expensive.

This gives us the best of both worlds, without artificially constraining us into a corner.

wence- commented 2 months ago

Perhaps I am not understanding something, but how does the requirement for copyable containers work with passing resource_refs to their constructors?

Suppose I make a vector whose allocation I want to control. I know the lifetime, and that it doesn't need to be managed with anything fancy, so I use a resource_ref:

{
  auto resource = some_internally_stateful_resource();
  auto vec = cuda::std::vector<...>(size, resource_ref{resource});
  ... do something with vec;
  // ~vec runs, ~resource runs, fine
}

No suppose that in "do something", I make a call into a third party library passing vec that, for whatever reason, takes a copy of vec whose lifetime outlives the block. How do I arrange that this is safe to do, or if not safe that I (preferably) get a compile-time error?

Is the answer that any function that is going to copy the containers also has to take the memory resource, such that the lifetime requirements are explicit.

Or is it that, in this scenario, my third-party library should somehow reject vectors that do not have a ref-counted memory resource inside them as their allocator?

Or some third thing I haven't thought of?

miscco commented 2 months ago

No suppose that in "do something", I make a call into a third party library passing vec that, for whatever reason, takes a copy of vec whose lifetime outlives the block. How do I arrange that this is safe to do, or if not safe that I (preferably) get a compile-time error?

This is not safe to do and will result in a runtime error. What the user should have done is pass in the resource as is

{
  auto vec = cuda::std::vector<...>(some_internally_stateful_resource(), size);
  ... do something with vec;
  // ~vec runs, ~resource runs, fine
}

That way, the resource would at least still be alive. On the other hand if the user really wants to pass in a resource they need to do something more fancy, either move the construction out of the block, or and we should add that use a "shared" resource that is ref counted

miscco commented 2 months ago

I have opened a PR for such a simple shared resource https://github.com/NVIDIA/cccl/pull/2398

wence- commented 2 months ago

This is not safe to do and will result in a runtime error.

What kind of runtime error? use-after-free, or something concretely debuggable?

wence- commented 2 months ago

if the user really wants to pass in a resource they need to do something more fancy

I think often the user does not even know they wanted to do this. It requires knowing, potentially, details of everything third party API call they make, along with its transitive dependencies.

I agree that the ref-counted resource solves this problem, thanks! But I am concerned that unless the constructor of allocated objects always reaches for this version, that the API will have many sharp edges.

harrism commented 2 months ago

But I am concerned that unless the constructor of allocated objects always reaches for this version, that the API will have many sharp edges.

I agree with this comment about sharp edges. It will be easy to forget to create a shared resource, especially when taking some global resource and wrapping it with a resource adaptor and then passing that to functions.

However, I think the common case is going to be something like we have in RMM:

auto vec = cuda::std::vector<...>(get_current_device_shared_resource(), size);

Where get_current_device_shared_resource() returns an instance of shared_resource, the ref-counted type-erased resource wrapper.

(In fact, RAPIDS will want a wrapper for cuda::std::vector that defaults to getting that resource.)

miscco commented 2 months ago

We are definitely tension field of security vs performance.

I really do not want every resource to container a shared_ptr that does ref counting. At the same time nothing here is new. If a user wants to share a resource they need to make sure the resource is alive for as long as it is used.

We give them one better tool to achieve that, but I do not believe there is a golden solution that will satisfy all needs

harrism commented 2 months ago

We are definitely tension field of security vs performance.

I don't know what this phrase means. :)

miscco commented 2 months ago

We are definitely tension field of security vs performance.

I don't know what this phrase means. :)

We need to balance security concerns with performance considerations. using refcounted resources everywhere is going to have a considerable performance impact, so I want to make sure that we only use them when needed

harrism commented 2 months ago

I am not recommending refcounted resources everywhere. However I think shared ownership is needed whenever adaptors are used or in central resource management. Especially when we get into C++/Python interop.

jrhemstad commented 2 months ago

I believe we are all aligned on what we're trying to achieve and I believe the current design has everything needed to make everyone happy. TL;DR:

harrism commented 2 months ago

Good summary. Thank you!

One more thing: we need a way to create a shared_resource from a non-copyable, non-movable resource. E.g. make_shared_resource<Resource>(ctor_args...).

harrism commented 2 months ago

A question. Often our functions take resource_ref arguments, because they are lightweight. But in the following, a function takes a resource_ref, adapts it, and constructs and returns a vector using the adapted MR:

cudax::vector<int> foo(..., resource_ref mr) {
  ...
  auto prefetch_mr = make_shared_resource<rmm::prefetch_resource_adaptor>(mr, args...);
  return cudax::vector<int>(size, prefetch_mr);
}

By using a shared_resource for the prefetch adaptor, we ensure that the adaptor's lifetime exceeds the returned vector. However because the upstream for the adaptor is mr, which is a ref, we have no guarantee that the upstream outlives the adaptor or the vector.

So does the recommendation now become to accept any_resource or shared_resource when objects allocated using that resource will outlive the function call? The guidance now starts to get complicated. Presumably passing any_resource or shared_resource by value is much more expensive than passing resource_ref.

jrhemstad commented 2 months ago

So does the recommendation now become to accept any_resource or shared_resource when objects allocated using that resource will outlive the function call?

This is a very good point, and I think you are correct. However, I don't think the pattern you described is representative of any libcudf examples, is it? I wouldn't expect we'd ever adapt an incoming memory resource on our own and return memory that was allocated with the adapted resource.

Presumably passing any_resource or shared_resource by value is much more expensive than passing resource_ref.

You may be pleasantly surprised. any_resource is just a resource_ref in disguise. It extends resource_ref's vtable and adds additional functions for the copy ctor/dtor/etc.

The size of the object is no different, it just requires initializing a few extra function pointers in the vtable.

harrism commented 2 months ago

it just requires initializing a few extra function pointers in the vtable.

Plus atomically incrementing a reference counter if it's a shared resource.

However, I don't think the pattern you described is representative of any libcudf examples, is it?

No, but I feel like I've seen something like it in RAFT? I am probably wrong.

Prefetching specifically is something we're doing in libcudf now, but I don't think we adapt the MR in the middle of a function like that.

ericniebler commented 2 months ago

i think we have resolved this issue. i'm closing it "not planned". @harrism feel free to reopen if you still feel that changes are needed here.