cuda-memcheck: "Address ... is out of bounds"

Describe the bug

cuda-memcheck reports scrolling errors on example-mnist-classification like this:

========= Invalid __global__ write of size 4
=========     at 0x00001780 in void copy_kernel<float>(cublasCopyParams<float>)
=========     by thread (191,0,0) in block (0,0,0)
=========     Address 0x7fd319043efc is out of bounds

To Reproduce

Steps to reproduce the behaviour:

cargo build
cuda-memcheck target/debug/example-mnist-classification mnist linear

Expected behavior

No errors.

Please complete the following information:

System: Manjaro Linux
Version: Git commit 854043d50bd
Rust: rustc 1.62.1 (e092d0b6b 2022-07-16)
Environment:

Backends (if relevant):

cuda:


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 515.57       Driver Version: 515.57       CUDA Version: 11.7     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
| N/A   51C    P8     5W /  N/A |      4MiB /  4096MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+ | Processes: | | GPU GI CI PID Type Process name GPU Memory | | ID ID Usage | |=============================================================================| | 0 N/A N/A 1936 G /usr/lib/Xorg 4MiB | +-----------------------------------------------------------------------------+



**Additional context**

Note that running `example-mnist-classification` _without_ `cuda-memcheck` works just fine and is able to converge. I only discovered this while working on #159 where doing training with CUDA does crash with `CUDA_ERROR_ILLEGAL_ADDRESS` when trying to copy from GPU to host. Not sure it's the same issue, but seems related.

Looks like the copy() function for CUDA doesn't check that the destination is large enough (here).

By adding a check like this:

macro_rules! iblas_copy_for_cuda {
    ($t:ident) => {
        fn copy(
            &self,
            x: &SharedTensor<$t>,
            y: &mut SharedTensor<$t>,
        ) -> Result<(), ::coaster::error::Error> {
            assert_eq!(x.desc().size(), y.desc().size());

We now get a panic:

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `300`,
 right: `10`', coaster-blas/src/frameworks/cuda/mod.rs:23:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:142:14
   2: core::panicking::assert_failed_inner
   3: core::panicking::assert_failed
             at /rustc/e092d0b6b43f2de967af0887873151bb1c0b18d3/library/core/src/panicking.rs:181:5
   4: coaster_blas::frameworks::cuda::<impl coaster_blas::plugin::Copy<f32> for coaster::backend::Backend<coaster::frameworks::cuda::Cuda>>::copy
             at ./coaster-blas/src/frameworks/cuda/helper.rs:109:13
   5: <juice::layers::common::linear::Linear as juice::layer::ComputeParametersGradient<f32,B>>::compute_parameters_gradient
             at ./juice/src/layers/common/linear.rs:220:9

So we're trying to copy 300 floats into a tensor with size 10. And it happens in the Linear layer here

fff-rs / juice

cuda-memcheck: "Address ... is out of bounds" #169