fff-rs / juice

The Hacker's Machine Learning Engine
1.11k stars 76 forks source link

cuda-memcheck: "Address ... is out of bounds" #169

Closed hweom closed 2 years ago

hweom commented 2 years ago

Describe the bug

cuda-memcheck reports scrolling errors on example-mnist-classification like this:

========= Invalid __global__ write of size 4
=========     at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
=========     by thread (191,0,0) in block (0,0,0)
=========     Address 0x7fd319043efc is out of bounds

To Reproduce

Steps to reproduce the behaviour:

  1. cargo build
  2. cuda-memcheck target/debug/example-mnist-classification mnist linear

Expected behavior

No errors.

Please complete the following information:

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1936 G /usr/lib/Xorg 4MiB | +-----------------------------------------------------------------------------+



**Additional context**

Note that running `example-mnist-classification` _without_ `cuda-memcheck` works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with `CUDA_ERROR_ILLEGAL_ADDRESS` when trying to copy from GPU to host. Not sure it's the same issue, but seems related.
hweom commented 2 years ago

Looks like the copy() function for CUDA doesn't check that the destination is large enough (here).

By adding a check like this:

macro_rules! iblas_copy_for_cuda {
    ($t:ident) => {
        fn copy(
            &self,
            x: &SharedTensor<$t>,
            y: &mut SharedTensor<$t>,
        ) -> Result<(), ::coaster::error::Error> {
            assert_eq!(x.desc().size(), y.desc().size());

We now get a panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `300`,
 right: `10`', coaster-blas/src/frameworks/cuda/mod.rs:23:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: coaster_blas::frameworks::cuda::<impl coaster_blas::plugin::Copy<f32> for coaster::backend::Backend<coaster::frameworks::cuda::Cuda>>::copy
             at ./coaster-blas/src/frameworks/cuda/helper.rs:109:13
   5: <juice::layers::common::linear::Linear as juice::layer::ComputeParametersGradient<f32,B>>::compute_parameters_gradient
             at ./juice/src/layers/common/linear.rs:220:9

So we're trying to copy 300 floats into a tensor with size 10. And it happens in the Linear layer here