`eachobsparallel` CUDA `Error while freeing...`

nikopj commented 1 year ago

There seems to be a bug when using a parallel dataloader and transfering to GPU. ~It's a bit difficult to reproduce / not consistent every run (bc of multithreading I suppose). It seems to involve heavy FileIO + CUDA in a for loop.~ I've narrowed it down to using eachobsparallel and being a function of batchsize and the number of threads. If the batchsize is not sizeably larger than threads (~x2), then the CUDA free error pops up within 1-3 dataloops.

In my tests, the MWE (`dl_test.jl`, below) produces an error according to this table: nthreads	batchsize	executor	result
2	1	ThreadedEx	works
2	2	ThreadedEx	works
2	4	ThreadedEx	works
2	8	ThreadedEx	works
4	1	ThreadedEx	works
4	2	ThreadedEx	works
4	4	ThreadedEx	FAILS
4	8	ThreadedEx	FAILS
4	16	ThreadedEx	works
8	8	ThreadedEx	FAILS
8	16	ThreadedEx	works
--------------	-------------	------------------	-----------
2	1	TaskPoolEx	works
2	2	TaskPoolEx	works
2	4	TaskPoolEx	works
2	8	TaskPoolEx	works
4	1	TaskPoolEx	works
4	2	TaskPoolEx	works
4	4	TaskPoolEx	works
4	8	TaskPoolEx	works
4	16	TaskPoolEx	works
8	8	TaskPoolEx	works
8	16	TaskPoolEx	works

This is using a 16 core CPU with 64 GBs of memory.

(dl_test.jl)

using MLUtils, CUDA

using FLoops
using FLoops.Transducers: ThreadedEx
using FoldsThreads: TaskPoolEx
import Base: length, getindex

BATCHSIZE = parse(Int, ARGS[1])

# Dummy Dataset
struct DummyDS
    num
end
function getindex(data::DummyDS, idx::Int)
    return randn(Float32, 128, 128, 3)
end
length(data::DummyDS) = data.num

ds = MLUtils.BatchView(DummyDS(5000); batchsize=BATCHSIZE, partial=false, collate=true)
dl = MLUtils.eachobsparallel(ds; executor=ThreadedEx())

function dummyloss(x)
    y = randn_like(x)
    return sum(abs, x - y)
end

function data_loop(loader)
    loss = 0
    for x in loader
        x = cu(x)
        loss += dummyloss(x)
        CUDA.unsafe_free!(x)
    end
    return nothing
end

for i = 1:20
    @time data_loop(dl)
end

Here's the accomanying error (for example when I run julia --project -t 8 dl_test.jl 8). This same error repeats many many many times.

WARNING: Error while freeing DeviceBuffer(1.500 MiB at 0x000014ca65000000):
UndefRefError()

Stacktrace:
  [1] current_device
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/devices.jl:24 [inlined]
  [2] #_free#998
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:485 [inlined]
  [3] _free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:479 [inlined]
  [4] macro expansion
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:464 [inlined]
  [5] macro expansion
    @ ./timing.jl:393 [inlined]
  [6] #free#997
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:463 [inlined]
  [7] free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:452 [inlined]
  [8] (::CUDA.var"#1004#1005"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuStream})()
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:130
  [9] #context!#887
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
 [10] context!
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
 [11] unsafe_free!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:129
 [12] unsafe_finalize!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:150
 [13] Array
    @ ./boot.jl:477 [inlined]
 [14] getindex
    @ ./array.jl:400 [inlined]
 [15] show_datatype
    @ ./show.jl:1058 [inlined]
 [16] _show_type(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:958
 [17] show(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:950
 [18] show_typeparams(io::IOContext{IOBuffer}, env::Core.SimpleVector, orig::Core.SimpleVector, wheres::Vector{TypeVar})
    @ Base ./show.jl:707
 [19] show_datatype(io::IOContext{IOBuffer}, x::DataType, wheres::Vector{TypeVar})
    @ Base ./show.jl:1092
--- the last 5 lines are repeated 4 more times ---
 [40] show_datatype
    @ ./show.jl:1058 [inlined]
 [41] _show_type(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:958
 [42] show(io::IOContext{IOBuffer}, x::Type)
    @ Base ./show.jl:950
 [43] print(io::IOContext{IOBuffer}, x::Type)
    @ Base ./strings/io.jl:35
 [44] print(::IOContext{IOBuffer}, ::String, ::Type, ::Vararg{Any})
    @ Base ./strings/io.jl:46
 [45] #with_output_color#962
    @ ./util.jl:76
 [46] printstyled(::IOContext{Core.CoreSTDOUT}, ::String, ::Vararg{Any}; bold::Bool, underline::Bool, blink::Bool, reverse::Bool, hidden::Bool, color::Symbol)
    @ Base ./util.jl:130
 [47] #print_within_stacktrace#538
    @ ./show.jl:2435
 [48] print_within_stacktrace
    @ ./show.jl:2433 [inlined]
 [49] show_signature_function
    @ ./show.jl:2427
 [50] #show_tuple_as_call#539
    @ ./show.jl:2459
 [51] show_tuple_as_call
    @ ./show.jl:2441 [inlined]
 [52] show_spec_linfo
    @ ./stacktraces.jl:244
 [53] print_stackframe
    @ ./errorshow.jl:730
 [54] print_stackframe
    @ ./errorshow.jl:695
 [55] #show_full_backtrace#921
    @ ./errorshow.jl:594
 [56] show_full_backtrace
    @ ./errorshow.jl:587 [inlined]
 [57] show_backtrace
    @ ./errorshow.jl:791
 [58] #free#997
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:473 [inlined]
 [59] free
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/src/pool.jl:452 [inlined]
 [60] (::CUDA.var"#1004#1005"{CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, CuStream})()
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:130
 [61] #context!#887
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:170 [inlined]
 [62] context!
    @ /scratch/npj226/.julia/packages/CUDA/tVtYo/lib/cudadrv/state.jl:165 [inlined]
 [63] unsafe_free!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer}, stream::CuStream)
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:129
 [64] unsafe_finalize!(xs::CuArray{Float32, 4, CUDA.Mem.DeviceBuffer})
    @ CUDA /scratch/npj226/.julia/packages/CUDA/tVtYo/src/array.jl:150
 [65] Array
    @ ./boot.jl:489 [inlined]
 [66] similar
    @ ./array.jl:374 [inlined]
 [67] similar
    @ ./abstractarray.jl:838 [inlined]
 [68] _typed_stack(::Colon, ::Type{Float32}, ::Type{Array{Float32, 3}}, A::Vector{Array{Float32, 3}}, Aax::Tuple{Base.OneTo{Int64}})
    @ Base ./abstractarray.jl:2797
 [69] _typed_stack
    @ ./abstractarray.jl:2793 [inlined]
 [70] _stack
    @ ./abstractarray.jl:2783 [inlined]
 [71] _stack
    @ ./abstractarray.jl:2775 [inlined]
 [72] #stack#178
    @ ./abstractarray.jl:2743 [inlined]
 [73] stack
    @ ./abstractarray.jl:2743 [inlined]
 [74] batch
    @ /scratch/npj226/.julia/dev/MLUtils/src/utils.jl:367 [inlined]
 [75] _getbatch(A::BatchView{Array{Float32, 4}, DummyDS, Val{true}}, obsindices::UnitRange{Int64})
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/batchview.jl:138
 [76] getindex
    @ /scratch/npj226/.julia/dev/MLUtils/src/batchview.jl:129 [inlined]
 [77] getobs(::Type{SimpleTraits.Not{MLUtils.IsTable{BatchView{Array{Float32, 4}, DummyDS, Val{true}}}}}, data::BatchView{Array{Float32, 4}, DummyDS, Val{true}}, idx::Int64)
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/observation.jl:110
 [78] getobs
    @ /scratch/npj226/.julia/packages/SimpleTraits/l1ZsK/src/SimpleTraits.jl:331 [inlined]
 [79] (::MLUtils.var"#58#59"{BatchView{Array{Float32, 4}, DummyDS, Val{true}}})(ch::Channel{Any}, i::Int64)
    @ MLUtils /scratch/npj226/.julia/dev/MLUtils/src/parallel.jl:66
 [80] macro expansion
    @ /scratch/npj226/.julia/dev/MLUtils/src/parallel.jl:124 [inlined]
 [81] ##reducing_function#293#68
    @ /scratch/npj226/.julia/packages/FLoops/6PVny/src/reduce.jl:817 [inlined]
 [82] (::InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}})(x::Tuple{}, y::Int64)
    @ InitialValues /scratch/npj226/.julia/packages/InitialValues/OWP8V/src/InitialValues.jl:306
 [83] next
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/combinators.jl:290 [inlined]
 [84] next
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/core.jl:289 [inlined]
 [85] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/core.jl:181 [inlined]
 [86] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:199 [inlined]
 [87] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/simd.jl:41 [inlined]
 [88] _foldl_linear_bulk
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:198 [inlined]
 [89] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:192 [inlined]
 [90] macro expansion
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/basics.jl:115 [inlined]
 [91] _foldl_array
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:188 [inlined]
 [92] __foldl__
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:182 [inlined]
 [93] foldl_basecase
    @ /scratch/npj226/.julia/packages/Transducers/yTXrD/src/processes.jl:365 [inlined]
 [94] _reduce_basecase(rf::Transducers.BottomRF{Transducers.AdHocRF{MLUtils.var"##oninit_function#292#67", typeof(identity), InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}}, typeof(identity), typeof(identity), MLUtils.var"##combine_function#294#69"}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers /scratch/npj226/.julia/packages/Transducers/yTXrD/src/threading_utils.jl:58
 [95] _reduce(ctx::Transducers.NoopDACContext, rf::Transducers.BottomRF{Transducers.AdHocRF{MLUtils.var"##oninit_function#292#67", typeof(identity), InitialValues.AdjoinIdentity{MLUtils.var"##reducing_function#293#68"{MLUtils.Loader, Channel{Any}}}, typeof(identity), typeof(identity), MLUtils.var"##combine_function#294#69"}}, init::Transducers.InitOf{Transducers.DefaultInitOf}, reducible::Transducers.SizedReducible{UnitRange{Int64}, Int64})
    @ Transducers /scratch/npj226/.julia/packages/Transducers/yTXrD/src/reduce.jl:150

Heres the output of CUDA.versioninfo() for reference:

CUDA runtime 11.8, artifact installation
CUDA driver 11.8
NVIDIA driver 520.61.5

CUDA libraries:
- CUBLAS: 11.11.3
- CURAND: 10.3.0
- CUFFT: 10.9.0
- CUSOLVER: 11.4.1
- CUSPARSE: 11.7.5
- CUPTI: 18.0.0
- NVML: 11.0.0+520.61.5

Julia packages:
- CUDA: 4.4.0
- CUDA_Driver_jll: 0.5.0+1
- CUDA_Runtime_jll: 0.6.0+0

Toolchain:
- Julia: 1.9.2
- LLVM: 14.0.6
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2, 7.3, 7.4, 7.5
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86

Environment:
- JULIA_CUDA_MEMORY_POOL: none

1 device:
  0: Quadro RTX 8000 (sm_75, 44.485 GiB / 45.000 GiB available)

And the package versions I'm using (] status):

  [052768ef] CUDA v4.4.0
  [cc61a311] FLoops v0.2.1
  [9c68100b] FoldsThreads v0.1.2
  [f1d291b0] MLUtils v0.4.3

nikopj commented 1 year ago

Note that the hanging issue described in #142 is still present with TaskPoolEx, but at least it runs!

ToucheSir commented 1 year ago

I was very confused at first, but It appears the actual error is masked by the catch block handling in https://github.com/JuliaGPU/CUDA.jl/blob/v4.4.0/src/pool.jl#L472-L474, which errors when trying to print the stacktrace. Can you change that to rethrow() instead or remove the catch block entirely to see what the root error is?

ToucheSir commented 1 year ago

I did a good deal more digging on this, and after asking around it seems to be an issue on the CUDA.jl side. Will update this issue with more details as I get them.

nikopj commented 1 year ago

This appears to be fixed on my end now with the upgraded CUDA version!

JuliaML / MLUtils.jl

`eachobsparallel` CUDA `Error while freeing...` #161