JuliaGPU / CUDA.jl

CUDA programming in Julia.
https://juliagpu.org/cuda/
Other
1.19k stars 214 forks source link

Support for JLD2 #1833

Closed denizyuret closed 10 months ago

denizyuret commented 1 year ago

Here is what I do to be able to save/load CuArrays with JLD2 files:

using CUDA
import JLD2, FileIO
struct JLD2CuArray{T,N}; array::Array{T,N}; end                                                                                              
JLD2.writeas(::Type{CuArray{T,N,D}}) where {T,N,D} = JLD2CuArray{T,N}                                                                        
JLD2.wconvert(::Type{JLD2CuArray{T,N}}, x::CuArray{T,N,D}) where {T,N,D} = JLD2CuArray(Array(x))                                             
JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = CuArray(x.array)                                                  

This used to work with CuArray{T,N} but no longer works with CuArray{T,N,D}. Here is the error I get:

julia> a = CUDA.rand(3,5)
julia> FileIO.save("foo.jld2", "a", a)
julia> d = FileIO.load("foo.jld2")
Dict{String, Any} with 1 entry:Error showing value of type Dict{String, Any}:                                                                
ERROR: CUDA error: invalid argument (code 1, ERROR_INVALID_VALUE)                                                                            
Stacktrace:                                                                                                                                  
  [1] throw_api_error(res::CUDA.cudaError_enum)                                                                                              
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:89                                                              
  [2] macro expansion                                                                                                                        
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/error.jl:97 [inlined]                                                         
  [3] cuMemcpyDtoHAsync_v2                                                                                                                   
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/utils/call.jl:26 [inlined]                                                            
  [4] #unsafe_copyto!#8                                                                                                                      
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/memory.jl:397 [inlined]                                                       
  [5] (::CUDA.var"#189#190"{Float32, Matrix{Float32}, Int64, CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, Int64, Int64})()                    
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:413                                                                     
  [6] #context!#63                                                                                                                           
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/state.jl:164 [inlined]                                                        
  [7] context!                                                                                                                               
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/lib/cudadrv/state.jl:159 [inlined]                                                        
  [8] unsafe_copyto!(dest::Matrix{Float32}, doffs::Int64, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}, soffs::Int64, n::Int64)           
    @ CUDA /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:406                                                                     
  [9] copyto!                                                                                                                                
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:360 [inlined]                                                                
 [10] copyto!                                                                                                                                
    @ /userfiles/dyuret/.julia/packages/CUDA/BbliS/src/array.jl:364 [inlined]                                                                
 [11] copyto_axcheck!(dest::Matrix{Float32}, src::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})                                                
    @ Base ./abstractarray.jl:1127                                                                                                           
 [12] Array                                                                                                                                  
    @ ./array.jl:626 [inlined]                                                                                                               
 [13] Array                                                                                                                                  
    @ ./boot.jl:483 [inlined]                                                                                                                
 [14] convert                                                                                                                                
    @ ./array.jl:617 [inlined]                                                                                                               
 [15] adapt_storage                                                                                                                          
    @ /userfiles/dyuret/.julia/packages/GPUArrays/XR4WO/src/host/abstractarray.jl:23 [inlined]                                               
 [16] adapt_structure                                                                                                                        
    @ /userfiles/dyuret/.julia/packages/Adapt/xviDc/src/Adapt.jl:57 [inlined]                                                                
 [17] adapt                                                                                                                                  
    @ /userfiles/dyuret/.julia/packages/Adapt/xviDc/src/Adapt.jl:40 [inlined]                                                                
 [18] _show_nonempty                                                                                                                         
    @ /userfiles/dyuret/.julia/packages/GPUArrays/XR4WO/src/host/abstractarray.jl:30 [inlined]                                               
 [19] show(io::IOContext{IOBuffer}, X::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer})                                                           
    @ Base ./arrayshow.jl:489                                                                                                                
 [20] sprint(f::Function, args::CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}; context::IOContext{Base.TTY}, sizehint::Int64)                   
    @ Base ./strings/io.jl:112                                                                                                               
 [21] show(io::IOContext{Base.TTY}, #unused#::MIME{Symbol("text/plain")}, t::Dict{String, Any})                                              
    @ Base ./show.jl:112                                                                                                                     

When I compare the original array with the loaded version they seem similar except for the refcount:

julia> dump(a)                                                                                                                               
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}                                                                                                   
  storage: CUDA.ArrayStorage{CUDA.Mem.DeviceBuffer}                                                                                          
    buffer: CUDA.Mem.DeviceBuffer                                                                                                            
      ctx: CuContext                                                                                                                         
        handle: Ptr{Nothing} @0x0000000002bbbe80                                                                                             
        valid: Bool true                                                                                                                     
      ptr: CuPtr{Nothing} CuPtr{Nothing}(0x0000000200e00000)                                                                                 
      bytesize: Int64 60                                                                                                                     
      async: Bool true                                                                                                                       
    refcount: Base.Threads.Atomic{Int64}                                                                                                     
      value: Int64 1                                                                                                                         
  maxsize: Int64 60                                                                                                                          
  offset: Int64 0                                                                                                                            
  dims: Tuple{Int64, Int64}                                                                                                                  
    1: Int64 3                                                                                                                               
    2: Int64 5                                                                                                                               
julia> dump(d["a"])                                                                                                                          
CuArray{Float32, 2, CUDA.Mem.DeviceBuffer}                                                                                                   
  storage: CUDA.ArrayStorage{CUDA.Mem.DeviceBuffer}                                                                                          
    buffer: CUDA.Mem.DeviceBuffer                                                                                                            
      ctx: CuContext                                                                                                                         
        handle: Ptr{Nothing} @0x0000000002bbbe80                                                                                             
        valid: Bool true                                                                                                                     
      ptr: CuPtr{Nothing} CuPtr{Nothing}(0x0000000200e00200)                                                                                 
      bytesize: Int64 60                                                                                                                     
      async: Bool true                                                                                                                       
    refcount: Base.Threads.Atomic{Int64}                                                                                                     
      value: Int64 0                                                                                                                         
  maxsize: Int64 60                                                                                                                          
  offset: Int64 0                                                                                                                            
  dims: Tuple{Int64, Int64}                                                                                                                  
    1: Int64 3                                                                                                                               
    2: Int64 5                                                                                                                               

Finally, if I assign the value read to a global variable in rconvert it works without any errors:

julia> JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = (y=CuArray(x.array); global dbg=y; y)
julia> d = FileIO.load("foo.jld2")
julia> d["a"] # works with no problems
maleadt commented 1 year ago

JLD2 has never really been supported. I guess the fact it worked was just sheer luck? In any case, I'm not familiar with JLD2, so I'll defer to anybody who is to take a look 🙂

JonasIsensee commented 1 year ago

Hi @denizyuret,

from the perspective of JLD2 your code looks absolutely ok. What versions are you on? I can't reproduce the problem.

denizyuret commented 1 year ago
[052768ef] CUDA v3.13.1 # (haven't upgraded to 4.x yet, but if it solves the JLD2 issue I will)
[5789e2e9] FileIO v1.16.0
[033835bb] JLD2 v0.4.31
julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 470.57.2, for CUDA 11.4
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+470.57.2
- CUDNN: 8.30.2 (for CUDA 11.5.0)
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
denizyuret commented 1 year ago

Alas my hope was shortlived :( I get the same error with CUDA v4.1.2

JonasIsensee commented 1 year ago

I still can't reproduce your error. (I tried julia 1.8.5 and 1.9.0-rc1 with CUDA 3.13.1 and JLD2 v0.4.31)

denizyuret commented 1 year ago

Can you send your CUDA.versioninfo so I can see what the difference may be? (library/driver version, gpu type etc could be a factor?)

JonasIsensee commented 1 year ago
julia> CUDA.versioninfo()
CUDA toolkit 11.7, artifact installation
NVIDIA driver 515.86.1, for CUDA 11.7
CUDA driver 11.7

Libraries: 
- CUBLAS: 11.10.1
- CURAND: 10.2.10
- CUFFT: 10.7.2
- CUSOLVER: 11.3.5
- CUSPARSE: 11.7.3
- CUPTI: 17.0.0
- NVML: 11.0.0+515.86.1
  Downloaded artifact: CUDNN
- CUDNN: 8.30.2 (for CUDA 11.5.0)
  Downloaded artifact: CUTENSOR
- CUTENSOR: 1.4.0 (for CUDA 11.5.0)

Toolchain:
- Julia: 1.8.5
- LLVM: 13.0.1
- PTX ISA support: 3.2, 4.0, 4.1, 4.2, 4.3, 5.0, 6.0, 6.1, 6.3, 6.4, 6.5, 7.0, 7.1, 7.2
- Device capability support: sm_35, sm_37, sm_50, sm_52, sm_53, sm_60, sm_61, sm_62, sm_70, sm_72, sm_75, sm_80, sm_86
noyongkyoon commented 1 year ago

I tried JLD2.writeas(), JLD2.wconvert(), and JLD2.rconvert() as you suggested. Now I get the following error message:

AssertionError: refcount != 0

Stacktrace:
 [1] _derived_array
   @ ~/.julia/packages/CUDA/BbliS/src/array.jl:729 [inlined]
 [2] reshape(a::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}, dims::Tuple{Int64})
   @ CUDA ~/.julia/packages/CUDA/BbliS/src/array.jl:723
 [3] reshape
   @ ./reshapedarray.jl:117 [inlined]
 [4] vec(a::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
   @ Base ./abstractarraymath.jl:41
 [5] (::RNN)(x::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer}; batchSizes::Nothing)
   @ Knet.Ops20 ~/.julia/packages/Knet/YIFWC/src/ops20/rnn.jl:332
 [6] (::RNN)(x::CuArray{Float32, 3, CUDA.Mem.DeviceBuffer})
   @ Knet.Ops20 ~/.julia/packages/Knet/YIFWC/src/ops20/rnn.jl:329
 [7] (::Chain)(x::Matrix{UInt16})
   @ Main ./In[5]:6
 [8] tag(tagger::Chain, s::String)
   @ Main ./In[29]:6
 [9] top-level scope
   @ In[30]:1

What is "refcount"? What purpose does it serve? How can one alter its value, if altering it is necessary? You do say above: "they seem similar except for the refcount." Can you elaborate on it?

JonasIsensee commented 1 year ago

Finally, if I assign the value read to a global variable in rconvert it works without any errors: julia> JLD2.rconvert(::Type{CuArray{T,N,D}}, x::JLD2CuArray{T,N}) where {T,N,D} = (y=CuArray(x.array); global dbg=y; y) julia> d = FileIO.load("foo.jld2") julia> d["a"] # works with no problems

This here (and also the refcount ) makes me think that this is a problem with the memory management when creating the CuArray. JLD2 allocates the underlying array and passes it to the CuArray(data) constructor and then ceases to keep track of it. (leading to refcount = 0). This would explain, why the global scope thing could fix it. @denizyuret Could you try a few functions of this type?

function f()
     data = rand(10,10)
     CuArray(data)
end
denizyuret commented 1 year ago

@denizyuret Could you try a few functions of this type?

The f() function you suggested works without problems. refcount of the resulting array is 1.

JLD2 allocates the underlying array and passes it to the CuArray(data) constructor and then ceases to keep track of it. (leading to refcount = 0).

CuArray copies the contents of data (stored in RAM) to the GPU memory, and once the GPU array is constructed I don't think it cares about what happens to the RAM array. But I am not sure what refcount is for and how it is set, so I may be talking nonsense. If I change the value of refcount manually to 0, things don't break for example.

@maleadt any idea how refcount=0 may appear and whether it may be the source of our problems?

maleadt commented 1 year ago

But I am not sure what refcount is for and how it is set, so I may be talking nonsense.

The refcount field is to keep track of the underlying buffer, so that multiple CuArrays can share the same memory (e.g., when you take a view, or reinterpret an array, or reshape it).

refcount=0 may happen when you're serializing a freed array.

JonasIsensee commented 1 year ago

The refcount field is to keep track of the underlying buffer, so that multiple CuArrays can share the same memory (e.g., when you take a view, or reinterpret an array, or reshape it).

refcount=0 may happen when you're serializing a freed array.

Thank you for this info. It is a bit odd, though. The problem here is most certainly during deserialization. (Otherwise the workarounds above couldn't work)

maleadt commented 1 year ago

Hmm, I was misunderstanding how JLD serializes object. If we're really just calling Array(...) and CuArray(...) (i.e., not serializing CuArray objects directly), I fail to see how we would ever run into refcount=0. FWIW, I also can't reproduce this issue.

JonasIsensee commented 1 year ago

Yeah, that's the curious bit. Let me summarize it quickly:

We define a struct JLD2CuArray that contains data that JLD2 can safely store, along with convert methods for both directions. (rconvert and wconvert - Base.convert also works but that is risky with invalidations...)

When you give JLD2 any object, it always asks JLD2.writeas what type to store it as (default writeas(::T) where T = T) and it will then call the conversion methods as necessary.

Therefore, with this code, we store the data in Array form AND the full CuArray{T,N,D} type signature (not shown) to call the correct rconvert method upon loading.

maleadt commented 1 year ago

The fact that the deserialized object contains a different buffer pointer indicates that the rconvert function has run. This seems to point to a GC-related issue, but if JLD2 is just storing the deserialized object in a regular dictionary the finalizer shouldn't ever run.

@denizyuret since only you seem to be able to reproduce this, I'd add some logging to the CuArray finalizer that decrements the refcount, to see when and from where it gets run (e.g. by adding sprint(Base.show_backtrace, backtrace()) or so to your log messages).