JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.16k stars 206 forks source link

Mixed eltype contraction failing with CuTensor #2349

Open kmp5VT opened 2 months ago

kmp5VT commented 2 months ago

Describe the bug

Some mixed element type contractions are failing when trying to contract with cuTENSOR.

To reproduce

The Minimal Working Example (MWE) for this bug:

a = CuArray{Float32}(undef, (10,10))
b = CuArray{ComplexF32}(undef, (10, 5))
c = a * b ## works fine regardless of mixed type

CT = CuTensor(a, [1,2]) * CuTensor(b, [2,3])
ERROR: KeyError: key (Float32, ComplexF32, ComplexF32) not found

CT = CuTensor(b, [2,3]) * CuTensor(a, [1,2])
ERROR: KeyError: key (ComplexF32, Float32, ComplexF32) not found 

a = CuArray{Float64}(undef, (10,10))
b = CuArray{ComplexF32}(undef, (10, 5))

CT = CuTensor(a, [1,2]) * CuTensor(b, [2,3])
ERROR: KeyError: key (Float64, ComplexF32, ComplexF64) not found

b = CuArray{ComplexF64}(undef, (10, 5))
CT = CuTensor(a, [1,2]) * CuTensor(b, [2,3]) ## works with no issue

All of these contractions have no issues when not wrapping the CuArrays in CuTensors

Manifest.toml

``` [052768ef] CUDA v5.3.1 [46192b85] GPUArraysCore v0.1.6 [011b41b2] cuTENSOR v2.1.0 ```

Version info

Details on Julia:

Julia Version 1.10.2
Commit bd47eca2c8a (2024-03-01 10:14 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 32 × Intel(R) Xeon(R) Gold 6244 CPU @ 3.60GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, cascadelake)
Threads: 1 default, 0 interactive, 1 GC (on 32 virtual cores)

Details on CUDA:

julia> CUDA.versioninfo()
CUDA runtime 12.4, artifact installation
CUDA driver 12.4
NVIDIA driver 535.154.5, originally for CUDA 12.2

CUDA libraries: 
- CUBLAS: 12.4.5
- CURAND: 10.3.5
- CUFFT: 11.2.1
- CUSOLVER: 11.6.1
- CUSPARSE: 12.3.1
- CUPTI: 22.0.0
- NVML: 12.0.0+535.154.5

Julia packages: 
- CUDA: 5.3.0
- CUDA_Driver_jll: 0.8.1+0
- CUDA_Runtime_jll: 0.12.1+0

Toolchain:
- Julia: 1.10.2
- LLVM: 15.0.7

1 device:
  0: NVIDIA RTX A6000 (sm_86, 45.532 GiB / 47.988 GiB available)
tgymnich commented 2 months ago

The contractions in your example won't work out of the box with cuTensor (see table): https://docs.nvidia.com/cuda/cutensor/latest/api/cutensor.html#contraction-operations

Should we automatically convert F32 to CF32?