JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.21k stars 220 forks source link

Exception output from many threads is not helpful #1780

Closed chengchingwen closed 6 months ago

chengchingwen commented 1 year ago

Describe the bug

This would crash julia if the array is large, happened on both 1.8.5 and 1.9-beta4.

To reproduce

The Minimal Working Example (MWE) for this bug:

julia> using CUDA                                                                                            

julia> f(x) = (Int32(1), x)                                                                                  
f (generic function with 1 method)                                                                           

julia> g(a, b) = (a[1] + b[1], b[2] * a[1] + b[1] / a[1])                                                    
g (generic function with 1 method)                                                                           

julia> mapreduce(f, g, CUDA.randn(10, 10, 10); dims=1, init=(one(Int32), zero(Float32)))                     
ERROR: a exception was thrown during kernel execution.                                                       
Stacktrace:                                                                                                  
 [1] CuDynamicSharedArray at /home/peter/.julia/packages/CUDA/ZdCxS/src/device/intrinsics/memory_shared.jl:52
 [2] CuDynamicSharedArray at /home/peter/.julia/packages/CUDA/ZdCxS/src/device/intrinsics/memory_shared.jl:61
 [3] reduce_block at /home/peter/.julia/packages/CUDA/ZdCxS/src/mapreduce.jl:57                              
 [4] partial_mapreduce_grid at /home/peter/.julia/packages/CUDA/ZdCxS/src/mapreduce.jl:126                   
Manifest.toml

``` Paste your Manifest.toml here, or accurately describe which version of CUDA.jl and its dependencies (GPUArrays.jl, GPUCompiler.jl, LLVM.jl) you are using. [052768ef] CUDA v4.0.1 [1af6417a] CUDA_Runtime_Discovery v0.1.1 [0c68f7d7] GPUArrays v8.6.3 [46192b85] GPUArraysCore v0.1.4 [61eb1bfa] GPUCompiler v0.17.2 [929cbde3] LLVM v4.16.0 ⌅ [4ee394cb] CUDA_Driver_jll v0.2.0+0 ⌅ [76a88914] CUDA_Runtime_jll v0.2.3+2 [62b44479] CUDNN_jll v8.6.0+3 ```

Version info

Details on Julia:

# please post the output of:
versioninfo()
Julia Version 1.9.0-beta4
Commit b75ddb787f (2023-02-07 21:53 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 12 × Intel(R) Core(TM) i7-7800X CPU @ 3.50GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake-avx512)
  Threads: 4 on 12 virtual cores

Details on CUDA:

# please post the output of:
CUDA.versioninfo()
CUDA runtime 11.8, artifact installation
CUDA driver 11.6
NVIDIA driver 510.73.5

Libraries: 
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+510.73.5

Toolchain:
- Julia: 1.9.0-beta4
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
maleadt commented 1 year ago

I don't see a segfault here, just the expected exception reporting.

chengchingwen commented 1 year ago

The segfault happened if you replace the CUDA.randn(10, 10, 10) with larger one like CUDA.randn(512, 128, 16).

maleadt commented 1 year ago

In that case you just get a lot of output on your terminal. I assume you hit CTRL-C, which might have killed Julia or CUDA then.

In any case, there isn't much we can do about this, as I/O is currently handled by CUDA. Maybe we could limit output by keeping track of written bytes and capping it, but that doesn't sound very satisfying.

chengchingwen commented 1 year ago

I didn't hit CTRL-C but wait for the output stop. It still result in segfault. Another issue is that this error is not always captured. On the machine (I reported above) the error is not shown unless I start julia with -g2.

maleadt commented 1 year ago

Another issue is that this error is not always captured. On the machine (I reported above) the error is not shown unless I start julia with -g2.

That's intentional. If you run without -g2 you get:

ERROR: a exception was thrown during kernel execution.
       Run Julia on debug level 2 for device stack traces.
chengchingwen commented 1 year ago

That's intentional.

Oh, I thought it would give a compile error, but it seems to run successfully and generate the correct result on that machine.

maleadt commented 1 year ago

Oh, I thought it would give a compile error, but it seems to run successfully and generate the correct result on that machine.

We can't generate a compile error because of Julia's dynamic semantics. You should still see a run-time exception though, albeit without a stack trace (you need -g2 for that).

chengchingwen commented 1 year ago

You should still see a run-time exception though

It didn't get any exception on that machine. And on another machine, it randomly failed.