Closed josephrocca closed 4 years ago
Looks to be a stream issue. If you change the stream from StreamFlags::NON_BLOCKING to StreamFlags::DEFAULT, the race condition disappears.
My hunch is that instead of running synchronously, the slice copies are running in the default stream instead. Thus, when you move the kernel from stream 1 to the default stream, it executes in the correct order.
Thanks! I'm not sure if this is expected behavior or not so I'll leave it to you or bheisler to close this issue if that's the case.
Yeah, I think this is just how CUDA works. Synchronizing different streams can get a bit tricky.
I've just started playing with RustaCUDA, and am relatively new to Rust, so apologies if there's something obvious I'm missing here, but it seems like
DeviceBuffer
allocations are happening asynchronously in my code. Here's a reduced test case (adapted from this blog post):The kernel just takes two vectors and pair-wise sums them into an output vector:
Version info:
When I compile and run with this command:
There's about a 50% chance that I get this as the output:
And a 50% chance that I get this:
And as you can see in that last example, the kernel additions are not reaching the end of the vector. When it doesn't reach the end it tends to get to around 70,000 out of 100,000 elements, give or take 10,000 (it's seemingly random within that range).
But if you uncomment the
std::thread::sleep
line, then everything works fine 100% of the time. So it seems like there's some sort of race condition here?