fail run train on device 1

Hi comunity:

I have 2 gpus in one node. I train successful on gpu card 0, and fail on gpu 1, the error message is `

/opt/product/DINO/src/dino/models/dino/ops/modules/ms_deform_attn.py:119 in forward │ │ │ │ 116 │ │ │ output = MSDeformAttnFunction.apply( │ │ 117 │ │ │ value.to(torch.float32), input_spatial_shapes, input_level_start_index, samp │ │ 118 │ │ │ output = output.to(torch.float16) │ │ ❱ 119 │ │ │ output = self.output_proj(output) │ │ 120 │ │ │ return output │ │ 121

│ │ │ 111 │ │ │ init.uniform_(self.bias, -bound, bound) │ │ 112 │ │ │ 113 │ def forward(self, input: Tensor) -> Tensor: │ │ ❱ 114 │ │ return F.linear(input, self.weight, self.bias) │ │ 115 │ │ │ 116 │ def extra_repr(self) -> str: │ │ 117 │ │ return 'in_features={}, out_features={}, bias={}'.format( │ │ │ │ ╭─────────────────────────────────────────── locals ───────────────────────────────────────────╮ │ │ │ input = <repr-error 'CUDA error: an illegal memory access was encountered\nCUDA kernel │ │ │ │ errors might be asynchronously reported at some other API call,so the stacktrace │ │ │ │ below might be incorrect.\nFor debugging consider passing CUDA_LAUNCH_BLOCKING=1.'> │ │ │ │ self = Linear(in_features=256, out_features=256, bias=True) │ │ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────╯ │ ╰──────────────────────────────────────────────────────────────────────────────────────────────────╯ RuntimeError: CUDA error: CUBLAS_STATUS_EXECUTION_FAILED when calling cublasLtMatmul( ltHandle, computeDesc.descriptor(), &alpha_val, mat1_ptr, Adesc.descriptor(), mat2_ptr, Bdesc.descriptor(), &beta_val, result_ptr, Cdesc.descriptor(), result_ptr, Cdesc.descriptor(), &heuristicResult.algo, workspace.data_ptr(), workspaceSize, at::cuda::getCurrentCUDAStream()) `

It seems to c++ code ms_deform_attn error . please help

IDEA-Research / DINO

fail run train on device 1 #215