bheisler / RustaCUDA

Rusty wrapper for the CUDA Driver API
Apache License 2.0
758 stars 60 forks source link

Mixing cublas and rustacuda ? #28

Closed zeroexcuses closed 5 years ago

zeroexcuses commented 5 years ago

Can we please have sample code that

  1. allocates some memory

  2. calls A = B * C

  3. calls some kernel on A

  4. calls sgemm D = E * A

? I have some tensor code that runs great in CPU mode, but fails in GPU mode (so the algorithm si correct). All CPU vs GPU unit tests pass -- so it seems I am running into a synchronization issue.

I am using stream.synchronize on after all kernel calls -- so it seems the remaining culprit is that kernels on streamA while cublas is on streamB .. and it's not clear to me how to synchronize the two.

rusch95 commented 5 years ago

Could you post a code snippet showing off this failure? I can hack on that.

On Mon, Jan 21, 2019 at 4:52 PM zeroexcuses notifications@github.com wrote:

Can we please have sample code that

1.

allocates some memory 2.

calls A = B * C 3.

calls some kernel on A 4.

calls sgemm D = E * A

? I have some tensor code that runs great in CPU mode, but fails in GPU mode (so the algorithm si correct). All CPU vs GPU unit tests pass -- so it seems I am running into a synchronization issue.

I am using stream.synchronize on after all kernel calls -- so it seems the remaining culprit is that kernels on streamA while cublas is on streamB .. and it's not clear to me how to synchronize the two.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/bheisler/RustaCUDA/issues/28, or mute the thread https://github.com/notifications/unsubscribe-auth/AKUNKJItU6Wwar560RdI7YJznNqf4uvxks5vFjaPgaJpZM4aLlJ2 .

zeroexcuses commented 5 years ago

I think I got it working via the following changes:

  1. I made Stream's 'inner CUstream' pub:

    #[derive(Debug)]
    pub struct Stream {
    pub inner: CUstream,
    }
  2. I initialize the cublas handle by calling

        unsafe {
            cublas::cublasSetStream_v2(gblas_handle.handle, stream.inner
                as *mut cuda_sys::cudart::CUstream_st );
        }

This appears to cause the blas to run on the same stream as the kernels.

However, I'm a bit uneasy as I'm brute force casing a sys::cuda::CUstream to a sys::cudart::CUstream

I'm not sure about the difference between the two.