Add support for asynchronous memcpy

bheisler commented 5 years ago

Copying memory asynchronously allows it the memcpy to overlap with other work as long as the work doesn't depend on the copied data. This is important for optimal performance, so RustaCUDA should provide access to it.

rusch95 commented 5 years ago

I'll take a stab at this.

rusch95 commented 5 years ago

My current plan is to add the trait AsyncCopyDestination which would have async_copy_from and async_copy_to. My other thought was implementing a version of CopyDestination that is async, but that precludes doing both sync and async copying.

@bheisler Thoughts?

rusch95 commented 5 years ago

Additionally, device.rs is getting rather large (~1200 lines), so I'd like your thoughts on splitting it into sync.rs for synchronous memory transfer functions and tests, async.rs for asynchronous functions and tests, and then device.rs for the rest.

rusch95 commented 5 years ago

Actually, spinning off DeviceBox, DeviceSlice, DeviceChunks, and DeviceBuffer into their own files, if possible, would probably be cleaner.

bheisler commented 5 years ago

I would split device.rs into three files for DeviceBox, DeviceSlice/DeviceChunks and DeviceBuffer.

As for CopyDestination - I'm still not entirely sold on the current design for that trait. Having both copy_from and copy_to on the same trait seems unnecessary. Anyway, fixing that (if I do fix that) would be another issue. Yeah, an AsyncCopyDestination seems like a good way to go.

rusch95 commented 5 years ago

Hmmm, I forgot how tricky async safety is. To make sure the arguments stay valid, maybe returning a promise bound to the lifetime of the passed references is the way to go?

bheisler commented 5 years ago

Yeah, this will be tricky alright. I haven't planned out a design for this. The only time we can be sure that it's safe to drop either the host-side or the device-side half of the transfer is after a synchronize (and it has to synchronize on the same stream as well).

I've been thinking about using the Futures API to handle asynchronous stuff safety (though I'm still fuzzy on the details) so it might be necessary to hold off on this until we figure that out some more.

rusch95 commented 5 years ago

My current thought is something similar to this code. .

This would also require bookkeeping in the buffers themselves to panic if the promise is dropped and then buffers used.

Alternatively, we could wait longer for async/await, the futures book, and all the other async goodies, and then go for the implementation, but I think that would require the same panic bookkeeping.

AndrewGaspar commented 5 years ago

Unfortunately you can't do this. Forgetting a value is safe in Rust. Therefore you could forget the promise while the buffers are still borrowed: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=26a7e7d4bb0a348ca05cc210e7c3962a

In rsmpi we solve this using a scope concept. See the example in the README. You can also look at the documentation. Essentially, a scope object has a shorter lifetime than all buffers used by the scope, and only a reference to the scope is given to user code, meaning it can't be forgotten. New Requests (essentially promises in MPI) are attached to a scope. When the scope lambda ends, it panics if any Requests attached the scope were not completed.

We also currently have an outstanding PR (that I still need to finish 😊) that allows you to attach a buffer wholesale to a request (i.e. Vec<T>, Box<[T]>, etc.). That's another thing you could allow.

rusch95 commented 5 years ago

Yeah, I didn't really explain the ideas for bookkeeping around if the promise is dropped at all. My bad on that.

Anyways, this scope approach looks very promising!

bheisler commented 5 years ago

Yeah, that fits really well with how I was planning to handle futures.

See, it's not zero-cost to create a Future tied to a CUDA stream - you have to add a CUevent to the stream that can be polled and probably queue up a callback to wake up the future. I don't think most users will want to poll on every asynchronous task, so I figured I'd do something like this:

let stream = ...;
// Need better names
let future = stream.submit(|executor| {
    // Submit a bunch of async work using the executor
    Ok(())
})?;

Then the submit function would return a Future that contained the closure (to keep the references alive). On the first poll (or maybe immediately? Not sure which is best) it would execute the closure to submit the work, submit the event and callback and prepare to poll the event to wait until the work is complete.

If we add non-futures-based async functions, that can just be a different submit function that synchronize()'s to block after calling the closure.

Now that I think about it, this would probably help solve the safety problems with Contexts as well.

rusch95 commented 5 years ago

Ah, I think I understand what your saying now and think that should work.

rusch95 commented 5 years ago

Cool, it works. Link.

Will need to sprinkle in some unsafe black magic, so that the data can be copied back from the mutable buffer by future async_memcpy calls.

bheisler commented 5 years ago

Slight problem with that: Link.

Scheduling multiple copies using the same buffer is completely safe as long as they're all on the same stream, but this implementation disallows it.

rusch95 commented 5 years ago

Yeah, that's what I was getting at with the second part of my comment.

My current solution is to return the references wrapped such that later async_copy calls can consume them, but they can't be de-refenced by other things.

let (_, mid_ref) = executor.copy(start_ref, mid_ref);
executor.copy(mid_ref, end_ref);

bheisler commented 5 years ago

I'd be very wary of unsafe black magic in this case - we could end up introducing undefined behavior while trying to hide other undefined behavior.

Anyway, this is kinda what I was thinking. If you can find an ergonomic way to make it unsafe to modify the buffers while they're used in an async copy, that's great. If not, I'd be OK with just doing this even if it is slightly vulnerable to data races.

rusch95 commented 5 years ago

How is pinned host memory done right now? Is that what the DeviceCopy trait indicates?

rusch95 commented 5 years ago

Additionally, implementing the unsafe wrapper layer is done now, save for the test async_copy_device_to_host not actually synchronizing for some reason and failing some of the time. I think the stack is pinned, so I don't think that is the cause.

After solving that issue, next up will be trying to wrap this all safely as futures, based on our earlier discussion.

bheisler commented 5 years ago

Page-locked memory all handled by the driver. You call a certain CUDA API function to allocate and free page-locked memory. The driver tracks which memory ranges are locked and uses a fast-path for copies to/from those ranges.

bheisler commented 5 years ago

DeviceCopy is for structures that can safely be copied to the device (ie. They don't manage host-side resources or contain pointers that are only valid for the host). It has nothing to do with page-locking, pinning or anything else.

rusch95 commented 5 years ago

Alright, so AsyncMemcpy requires pinned memory, but also runtime errors properly if given not page-locked memory, so we don't necessarily need to mark that in the wrapper.

rusch95 commented 5 years ago

The error I was mentioning only appears when multiple tests are run at the same time.

EDIT: Nevermind, it appears rarely when run alone.

rusch95 commented 5 years ago

Alright to sum up my current thoughts on this.

AsyncMemcpy requires locked memory, thus we can async copy from and to devices using only DeviceFoo, UnifiedFoo, and LockedFoo.
AsyncMemcpy cannot take T: DeviceCopy anymore. Thus, providing safety by forcing the arguments to be of a certain form, specifically the form that is protected R/W lock, isn't too horrible a choice now, especially considering that most workloads will be large arrays. However, we still might be able to use scoping and the borrow checker to get rid of that overhead, in exchange for a little more verbose syntax.
For seeing when the AsyncMemcpy's are done, I think Futures are still a nice thing to provide. Calling context.sync() or stream.sync() will probably be useful for proof of concepts and certain workflows, so I also want that as an optional, though it might provide only runtime, instead of compile time safety.

bheisler commented 5 years ago

AsyncMemcpy cannot take T: DeviceCopy anymore.

Could you elaborate more on this? Why not?

rusch95 commented 5 years ago

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

khyperia commented 5 years ago

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

I'd be completely fine with really low-tech solutions to this problem, just to get the feature out there:

1) Just make copy_{from,to}_async unsafe, and point out in documentation "hey, don't free this memory" 2) Make copy_{from,to}_async, take an Arc<LockedBuffer>. Then, inside, add the Arc to a Vec that gets cleared whenever the user calls synchronize or when any (possibly-user-provided) CUevent callback fires (in that case, only clear ones before the event was queued).

Something that I can't seem to find any documentation on is the behavior of the driver when a buffer is freed in the middle of work. The driver may already take care of the hard parts of this - cuMemFreeHost may not actually free the buffer until work is complete. In that case, we wouldn't have to worry about any of this.

(I'd be happy to write a PR for option 1, and if you like it, a PR for option 2 given a bit of time)

rusch95 commented 5 years ago

Let me finish up 1. It's pretty much done with a PR up right now, I just need to rebase it and clean a bit more, but have been slow on that because of the holidays. I'll schedule some time to finish it up by tomorrow.

I'll defer to you on doing 2, since I'll be busy for a while. I think you probably want something more of the form Arc<RWLock<LockedBuffer>> though for the safe form, where the locks held are released after synchronization.

rusch95 commented 5 years ago

See #20 for the PR I'm writing.

bheisler commented 5 years ago

Hey, I'm really interested in this feature! (I'm porting my hobby raytracer to rustacuda)

Thanks for your interest, and thanks for trying RustaCUDA! Yeah, I'd be interested in pull requests, though rusch95 has already submitted a WIP PR to add an unsafe interface for async memcpy. We may have to iterate a few times to find a good balance of safety, ergonomics and performance for the safe interface.

LutzCle commented 5 years ago

Previously, I thought AsyncCopyDestination might be able to take T: DeviceCopy references as values for copy_from and copy_to, just like CopyDestination. However, it appears not using page-locked memory will throw errors according to the documentation. I haven't gotten these errors, only silent failures, which is strange, so I might write some more test cases to chase down what's going on there.

This is outdated information. The documentation you're referencing is for CUDA 2.3.

Modern CUDA versions are able to use any type of memory (both pageable and page-locked) in cuMemcpyAsync(). The documentation makes no comment on page-locked memory anymore. In fact, I've already used pageable memory in a project before.

Please do refer to, e.g., the CUDA 8.0 documentation or later. It would be unfortunate if RustaCUDA were to enforce such outdated limitations using the Rust type system.

rusch95 commented 5 years ago

Previously, my test failures entirely vanished by switching from pageable to page-locked memory, but sure, I'll look into it. I can see it possibly resulting from some other issue that switching to page-locked fixed.

rusch95 commented 5 years ago

Digging this back up now that async is mostly stabilized to note that I'll try adding in a proper async API.

bheisler / RustaCUDA

Add support for asynchronous memcpy #8