DeviceBuffer allocation doesn't complete before kernel launch?

josephrocca commented 4 years ago

I've just started playing with RustaCUDA, and am relatively new to Rust, so apologies if there's something obvious I'm missing here, but it seems like DeviceBuffer allocations are happening asynchronously in my code. Here's a reduced test case (adapted from this blog post):

// main.rs
use rustacuda::launch;
use rustacuda::prelude::*;
use std::error::Error;
use std::ffi::CString;

fn main() -> Result<(), Box<dyn Error>> {

    let _context = rustacuda::quick_init()?;
    let module = Module::load_from_string( &CString::new(include_str!("add.ptx"))? )?;
    let stream = Stream::new(StreamFlags::NON_BLOCKING, None)?;

    let n = 100000;
    let x_host: Vec<f32> = (0..n).map(|n| n as f32).collect();
    let y_host: Vec<f32> = (0..n).map(|n| n as f32).collect();
    let mut result_host: Vec<f32> = (0..n).map(|_| 0.0).collect();

    let mut x_device = DeviceBuffer::from_slice(&x_host)?;
    let mut y_device = DeviceBuffer::from_slice(&y_host)?;
    let mut result_device = DeviceBuffer::from_slice(&result_host)?;

    //std::thread::sleep( std::time::Duration::from_millis(100) );

    unsafe {
        launch!(module.add<<<(14, 14, 1), 512, 0, stream>>>(
            n,
            x_device.as_device_ptr(),
            y_device.as_device_ptr(),
            result_device.as_device_ptr(),
            result_device.len()
        ))?;
    }

    stream.synchronize()?;
    result_device.copy_to(&mut result_host)?;

    println!("x end: {:?}", &x_host[n as usize - 5..]);
    println!("y end: {:?}", &y_host[n as usize - 5..]);
    println!("result end: {:?}", &result_host[n as usize - 5..]);

    Ok(())
}

The kernel just takes two vectors and pair-wise sums them into an output vector:

// add.cu
extern "C" __global__ void add(int n, float *x, float *y, float *out) {
    long blockId = blockIdx.z  *  gridDim.x*gridDim.y
                 + blockIdx.y  *  gridDim.x
                 + blockIdx.x;
    long threadsPerBlock = blockDim.x;
    long i = blockId * threadsPerBlock + threadIdx.x;

    if (i < n) {
        out[i] = x[i] + y[i];
    }
}

Version info:

rustacuda = "0.1"
rustacuda_derive = "0.1"
rustacuda_core = "0.1"

$ nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Sun_Jul_28_19:07:16_PDT_2019
Cuda compilation tools, release 10.1, V10.1.243

Graphics Card: GTX 2070 (mobile/laptop version)

When I compile and run with this command:

nvcc src/add.cu --ptx -o src/add.ptx --gpu-architecture=compute_75
env LIBRARY_PATH=/usr/local/cuda/lib64 cargo +nightly run --release

There's about a 50% chance that I get this as the output:

x end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
y end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
result end: [199990.0, 199992.0, 199994.0, 199996.0, 199998.0]

And a 50% chance that I get this:

x end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
y end: [99995.0, 99996.0, 99997.0, 99998.0, 99999.0]
result end: [0.0, 0.0, 0.0, 0.0, 0.0]

And as you can see in that last example, the kernel additions are not reaching the end of the vector. When it doesn't reach the end it tends to get to around 70,000 out of 100,000 elements, give or take 10,000 (it's seemingly random within that range).

But if you uncomment the std::thread::sleep line, then everything works fine 100% of the time. So it seems like there's some sort of race condition here?

rusch95 commented 4 years ago

Looks to be a stream issue. If you change the stream from StreamFlags::NON_BLOCKING to StreamFlags::DEFAULT, the race condition disappears.

rusch95 commented 4 years ago

My hunch is that instead of running synchronously, the slice copies are running in the default stream instead. Thus, when you move the kernel from stream 1 to the default stream, it executes in the correct order.

josephrocca commented 4 years ago

Thanks! I'm not sure if this is expected behavior or not so I'll leave it to you or bheisler to close this issue if that's the case.

bheisler commented 4 years ago

Yeah, I think this is just how CUDA works. Synchronizing different streams can get a bit tricky.

bheisler / RustaCUDA

DeviceBuffer allocation doesn't complete before kernel launch? #41