Open bheisler opened 5 years ago
I'll take a stab at this.
My current plan is to add the trait AsyncCopyDestination
which would have async_copy_from
and async_copy_to
. My other thought was implementing a version of CopyDestination
that is async, but that precludes doing both sync and async copying.
@bheisler Thoughts?
Additionally, device.rs
is getting rather large (~1200 lines), so I'd like your thoughts on splitting it into sync.rs
for synchronous memory transfer functions and tests, async.rs
for asynchronous functions and tests, and then device.rs
for the rest.
Actually, spinning off DeviceBox
, DeviceSlice
, DeviceChunks
, and DeviceBuffer
into their own files, if possible, would probably be cleaner.
I would split device.rs
into three files for DeviceBox
, DeviceSlice/DeviceChunks
and DeviceBuffer
.
As for CopyDestination
- I'm still not entirely sold on the current design for that trait. Having both copy_from
and copy_to
on the same trait seems unnecessary. Anyway, fixing that (if I do fix that) would be another issue. Yeah, an AsyncCopyDestination
seems like a good way to go.
Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go?
Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a synchronize
(and it has to synchronize on the same stream as well).
I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more.
My current thought is something similar to this code. .
This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used.
Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping.
Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=26a7e7d4bb0a348ca05cc210e7c3962a
In rsmpi we solve this using a scope
concept. See the example in the README. You can also look at the documentation. Essentially, a scope
object has a shorter lifetime than all buffers used by the scope, and only a reference to the scope is given to user code, meaning it can't be forgotten. New Request
s (essentially promises in MPI) are attached to a scope
. When the scope lambda ends, it panics if any Request
s attached the scope
were not completed.
We also currently have an outstanding PR (that I still need to finish 😊) that allows you to attach a buffer wholesale to a request (i.e. Vec<T>
, Box<[T]>
, etc.). That's another thing you could allow.
Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that.
Anyways, this scope approach looks very promising!
Yeah, that fits really well with how I was planning to handle futures.
See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a CUevent
to the stream that can be polled and probably queue up a callback to wake up the future. I don't think most users will want to poll on every asynchronous task, so I figured I'd do something like this:
let stream = ...;
// Need better names
let future = stream.submit(|executor| {
// Submit a bunch of async work using the executor
Ok(())
})?;
Then the submit
function would return a Future that contained the closure (to keep the references alive). On the first poll (or maybe immediately? Not sure which is best) it would execute the closure to submit the work, submit the event and callback and prepare to poll the event to wait until the work is complete.
If we add non-futures-based async functions, that can just be a different submit
function that synchronize()
's to block after calling the closure.
Now that I think about it, this would probably help solve the safety problems with Contexts as well.
Ah, I think I understand what your saying now and think that should work.
Cool, it works. Link.
Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls.
Slight problem with that: Link.
Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it.
Yeah, that's what I was getting at with the second part of my comment.
My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.
let (_, mid_ref) = executor.copy(start_ref, mid_ref);
executor.copy(mid_ref, end_ref);
I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior.
Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races.
How is pinned host memory done right now? Is that what the DeviceCopy trait indicates?
Additionally, implementing the unsafe wrapper layer is done now, save for the test async_copy_device_to_host
not actually synchronizing for some reason and failing some of the time. I think the stack is pinned, so I don't think that is the cause.
After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion.
Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges.
DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else.
Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper.
The error I was mentioning only appears when multiple tests are run at the same time.
EDIT: Nevermind, it appears rarely when run alone.
Alright to sum up my current thoughts on this.
context.sync()
or stream.sync()
will probably be useful for proof of concepts and certain workflows, so I also want that as an optional, though it might provide only runtime, instead of compile time safety. AsyncMemcpy cannot take T: DeviceCopy anymore.
Could you elaborate more on this? Why not?
Previously, I thought AsyncCopyDestination
might be able to take T: DeviceCopy
references as values for copy_from
and copy_to
, just like CopyDestination
. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.
Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)
I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:
1) Just make copy_{from,to}_async
unsafe, and point out in documentation "hey, don't free this memory"
2) Make copy_{from,to}_async
, take an Arc<LockedBuffer>
. Then, inside, add the Arc
to a Vec
that gets cleared whenever the user calls synchronize
or when any (possibly-user-provided) CUevent
callback fires (in that case, only clear ones before the event was queued).
Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - cuMemFreeHost
may not actually free the buffer until work is complete. In that case, we wouldn't have to worry about any of this.
(I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time)
Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow.
I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form Arc<RWLock<LockedBuffer>>
though for the safe form, where the locks held are released after synchronization.
See #20 for the PR I'm writing.
Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)
Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface.
Previously, I thought
AsyncCopyDestination
might be able to takeT: DeviceCopy
references as values forcopy_from
andcopy_to
, just likeCopyDestination
. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.
This is outdated information. The documentation you're referencing is for CUDA 2.3.
Modern CUDA versions are able to use any type of memory (both pageable and page-locked) in cuMemcpyAsync(). The documentation makes no comment on page-locked memory anymore. In fact, I've already used pageable memory in a project before.
Please do refer to, e.g., the CUDA 8.0 documentation or later. It would be unfortunate if RustaCUDA were to enforce such outdated limitations using the Rust type system.
Previously, my test failures entirely vanished by switching from pageable to page-locked memory, but sure, I'll look into it. I can see it possibly resulting from some other issue that switching to page-locked fixed.
Digging this back up now that async is mostly stabilized to note that I'll try adding in a proper async API.
Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.