denizyuret / Knet.jl

Koç University deep learning framework.
https://denizyuret.github.io/Knet.jl/latest
Other
1.43k stars 230 forks source link

error in gcnode (KeyError: key ... not found) #624

Closed Alexander-Barth closed 3 years ago

Alexander-Barth commented 3 years ago

I am seeing this error below in Knet (presumable Autograd garbage collectors) about 1 in 4 times. In my case I use the python module skopt to optimize some hyperparameter (called with PyCall). In the same Julia program, I run thus multiple training loops sequentially (with different numbers of layers, L2 regularization,...). It works for 3 iterations but at the 4th iteration I get this error during the 1st epoch.

In a different case, where I use the same code but slightly more input features this issue does not show up.

This error appears in about 6 to 7 hours of computing (which does not help to create a minimal reproducer).

I call these commands after every time a neural network is trained and validated.

GC.gc()

Are there any ideas what I could try to make the training more stable?

Thank you for your hard work on Knet which is very helpful in my work! On a different computer (with only 8 GB GPU RAM) , Knet runs rock solid for more than 10 days (and is still running)!

ERROR: LoadError: (in a Julia function called from Python)
JULIA: KeyError: key 0x1818ad5e70b5b27c not found
Stacktrace:
 [1] getindex at ./dict.jl:467 [inlined]
 [2] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/ulg/gher/abarth/.julia/packages/Knet/rgT4R/src/autograd_gpu/gcnode.jl:79
 [3] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/ulg/gher/abarth/.julia/packages/AutoGrad/VFrAv/src/co
re.jl:168
 [4] differentiate at /home/ulg/gher/abarth/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
 [5] iterate at /home/ulg/gher/abarth/.julia/packages/Knet/rgT4R/src/train20/train.jl:26 [inlined]
 [6] iterate(::Base.Iterators.Enumerate{Knet.Train20.Minimize{DINCAE.PointCloud{KnetArray{Float32,N} where N,Float32,2}}}, ::Tuple{Int64,Int64}) at ./iterators.jl:139
 [7] macro expansion at /home/ulg/gher/abarth/.julia/dev/DINCAE/src/points.jl:574 [inlined]
 [8] macro expansion at ./timing.jl:174 [inlined]
 [9] reconstruct_points(::Type{T} where T, ::Type{T} where T, ::String, ::String, ::Tuple{StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float6
4}},StepRangeLen{Float64,Base.TwicePrecision{Float64},Base.TwicePrecision{Float64}}}, ::Array{String,1}; epochs::Int64, batch_size::Int64, truth_uncertain::Bool, enc_
nfilter_internal::StepRange{Int64,Int64}, skipconnections::UnitRange{Int64}, clip_grad::Float64, regularization_L1_beta::Int64, regularization_L2_beta::Float64, save_
epochs::Array{Int64,1}, upsampling_method::Symbol, probability_skip_for_training::Float64, jitter_std_pos::Tuple{Float32,Float32}, ntime_win::Int64, learning_rate::Fl
oat64, learning_rate_decay_epoch::Float64, loss_weights_refine::Tuple{Float64}, auxdata_files::Array{Any,1}, savesnapshot::Bool) at /home/ulg/gher/abarth/.julia/dev/D
INCAE/src/points.jl:563
 [10] func(::Array{Any,1}) at /home/users/a/b/abarth/.julia/dev/DINCAE/examples/DINCAE_altimetry_optim.jl:142
 [11] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [12] invokelatest(::Any, ::Any) at ./essentials.jl:709
 [13] _pyjlwrap_call(::Function, ::Ptr{PyCall.PyObject_struct}, ::Ptr{PyCall.PyObject_struct}) at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/callback.jl:28
 [14] pyjlwrap_call(::Ptr{PyCall.PyObject_struct}, ::Ptr{PyCall.PyObject_struct}, ::Ptr{PyCall.PyObject_struct}) at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/callback.jl:49
 [15] macro expansion at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/exception.jl:95 [inlined]
 [16] #110 at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/pyfncall.jl:43 [inlined]
 [17] disable_sigint at ./c.jl:446 [inlined]
 [18] __pycall! at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/pyfncall.jl:42 [inlined]
 [19] _pycall!(::PyObject, ::PyObject, ::Tuple{typeof(func),Array{PyObject,1}}, ::Int64, ::PyObject) at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/pyfncall.jl:29
 [20] _pycall!(::PyObject, ::PyObject, ::Tuple{typeof(func),Array{PyObject,1}}, ::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:n_calls, :x0),Tuple{Int64,PyVector{Any}}}}) at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/pyfncall.jl:11
 [21] (::PyObject)(::Function, ::Vararg{Any,N} where N; kwargs::Base.Iterators.Pairs{Symbol,Any,Tuple{Symbol,Symbol},NamedTuple{(:n_calls, :x0),Tuple{Int64,PyVector{Any}}}}) at /home/ulg/gher/abarth/.julia/packages/PyCall/BcTLp/src/pyfncall.jl:86
 [22] top-level scope at /home/users/a/b/abarth/.julia/dev/DINCAE/examples/DINCAE_altimetry_optim.jl:253
 [23] include(::Function, ::Module, ::String) at ./Base.jl:380
 [24] include(::Module, ::String) at ./Base.jl:368
 [25] exec_options(::Base.JLOptions) at ./client.jl:296

These are the versions that I use:

Julia: 1.5.2 Knet v1.4.2 CUDA v1.3.3 (with CUDA: 11.0) AutoGrad v1.2.3 GPU: Tesla V100-PCIE OS: Linux

denizyuret commented 3 years ago

@Alexander-Barth it would be great if we could discover why this is happening. But let's find you a workaround first:

  1. The AutoGrad gc is an optional feature that saves you about 30-40% gpu memory. You can disable it using:

    AutoGrad.set_gc_function(AutoGrad.default_gc)
  2. If you need the 30-40% to fit the model on your GPU, you can try upgrading to Knet-1.4.3 and CUDA-2.x to see if this improves stability (in case the problem is with the CUDA-1.x memory manager).

  3. You can can make the following change to gcnode.jl:79 which seems to cause the problem:

    pi = get(gcnode_index, objectid(parent), 0)

    This will at worst fail to gc some nodes but will not change anything in the computation.

  4. If we can figure out the status of the tape when this happens, I can debug further. Maybe when the line above fails to find the key objectid(parent) in the dict gcnode_index and returns 0, you can dump the following relevant variables into a JLD2 file and send it to me for debugging:

    n, tape, i, parent, gcnode_queue, gcnode_dict, gcnode_index, gcnode_tape
Alexander-Barth commented 3 years ago

I did not see that there was a newer release. I think the best is to upgrade to verify that this issue is still present. Thank you for all your insights!

Alexander-Barth commented 3 years ago

Concerning point 2.: I can confirm that I see the issue is also on Knet-1.4.3 and CUDA-2.0.2. I am now trying point 1 (with Knet master) and so far it run fine (for already 16 hours). Once this run is finished add the debugging code from point 4.

Thanks again for your help!

andevellicus commented 3 years ago

I seem to be getting this pretty consistently now as well.

CUDA v2.0.2. If I use CUDA v2.3 I get #632 issues Knet 1.4.3 Julia 1.5.3

EDIT: Still getting this with CUDA v2.3 EDIT2: Point 3 didn't fix it.

andevellicus commented 3 years ago

So when I implemented Point 3, the following showed up:

 [1] getindex at ./dict.jl:467 [inlined]
 [2] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/xxx/.julia/packages/Knet/OYNCT/src/autograd_gpu/gcnode.jl:71
 [3] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
 [4] differentiate at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
 [5] minimize!(::UNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at /home/xxx/Code/julia/knet-sdh-seg/train.jl:14
 [6] main() at /home/xxx/Code/julia/knet-sdh-seg/train.jl:89
 [7] top-level scope at /home/xxx/Code/julia/knet-sdh-seg/train.jl:113
 [8] include(::Function, ::Module, ::String) at ./Base.jl:380
 [9] include(::Module, ::String) at ./Base.jl:368
 [10] exec_options(::Base.JLOptions) at ./client.jl:296
 [11] _start() at ./client.jl:506

I did the same for the gcnode.jl:71 and then CUDA freaks out:

CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)                                                    

Stacktrace:                                                                                                                                                                                                                                    
 [1] throw_api_error(::CUDA.cudaError_enum) at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:97                                                                                                                           
 [2] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:104 [inlined]
 [3] cuMemFree_v2 at /home/xxx/.julia/packages/CUDA/YeS8q/lib/utils/call.jl:93 [inlined]              
 [4] free(::CUDA.Mem.DeviceBuffer) at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/memory.jl:77
 [5] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:209 [inlined]
 [6] macro expansion at ./timing.jl:233 [inlined]                                                                      
 [7] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:208 [inlined]
 [8] macro expansion at /home/xxx/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [9] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:207 [inlined]
 [10] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/state.jl:206 [inlined]
 [11] actual_free(::CuDevice, ::CUDA.Block) at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:205
 [12] pool_free(::CuDevice, ::CUDA.Block) at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool/binned.jl:229
 [13] macro expansion at /home/xxx/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
 [14] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:389 [inlined]
 [15] macro expansion at ./timing.jl:233 [inlined]                                                                     
 [16] free at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:388 [inlined]                                                                                                                                                          
 [17] unsafe_free!(::CuArray{UInt8,1}) at /home/xxx/.julia/packages/CUDA/YeS8q/src/array.jl:42                                                                                                                                        
 [18] freeKnetPtrCu(::Knet.KnetArrays.KnetPtr) at /home/xxx/.julia/packages/Knet/OYNCT/src/knetarrays/kptr.jl:239
 [19] show_exception_stack(::IOContext{Base.TTY}, ::Array{Any,1}) at ./errorshow.jl:737
 [20] display_error(::Base.TTY, ::Array{Any,1}) at ./client.jl:104
 [21] display_error(::Array{Any,1}) at ./client.jl:106
 [22] #invokelatest#1 at ./essentials.jl:710 [inlined]
 [23] invokelatest at ./essentials.jl:709 [inlined]
 [24] exec_options(::Base.JLOptions) at ./client.jl:298
 [25] _start() at ./client.jl:506
LoadError: WARNING: Error while freeing CuPtr{Nothing}(0x00007f08e0000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
denizyuret commented 3 years ago

@andevellicus :

Alexander-Barth commented 3 years ago

For your information, I am also running the test 4 now (so far no error). I made these changes:

diff --git a/src/Knet.jl b/src/Knet.jl
index 8e5115a..283f9bf 100644
--- a/src/Knet.jl
+++ b/src/Knet.jl
@@ -1,5 +1,7 @@
 module Knet

+using JLD2
+
 "Construct a path relative to Knet root, e.g. Knet.dir(\"examples\") => \"~/.julia/dev/Knet/examples\""
 dir(path...)=joinpath(dirname(@__DIR__),path...)

diff --git a/src/autograd_gpu/gcnode.jl b/src/autograd_gpu/gcnode.jl
index 9f5eeb5..982d794 100644
--- a/src/autograd_gpu/gcnode.jl
+++ b/src/autograd_gpu/gcnode.jl
@@ -2,6 +2,7 @@ using CUDA: CuArray, unsafe_free!
 using Knet.CuArrays: cuarrays  # Profiling shows most of the time is spent in cuarrays
 using Knet.KnetArrays: cuallocator
 using AutoGrad: Result, Node, Tape
+using JLD2

 # During the back pass we want to make pointers available as soon as we can to save memory
 # without waiting for gc. This is risky as we have to make sure the pointers are not going to
@@ -68,6 +69,10 @@ function gcnode(n::Node, tape::Tape)
         gcnode_init(tape)
     end
     tape isa Tape || return
+
+    if !haskey(gcnode_index,objectid(n))
+        JLD2.@save "/home/ulg/gher/abarth/tape.jld2" n  tape  i  parent  gcnode_queue  gcnode_dict  gcnode_index  gcnode_tape
+    end
     ni = gcnode_index[objectid(n)]
     if n.Value isa Result
         for c in cuarrays(n.outgrad); gcnode_setindex!(c, ni); end
Alexander-Barth commented 3 years ago

Unfortunately, it seems that JLD2 cannot handle pointers:

ERROR: LoadError: cannot write a pointer to JLD file
Stacktrace:
 [1] h5type(::JLD2.JLDFile{JLD2.MmapIO}, ::Type{Ptr{Nothing}}, ::Any) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data.jl:1084
 [2] h5type(::JLD2.JLDFile{JLD2.MmapIO}, ::Ptr{Nothing}) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data.jl:170
 [3] write_dataset at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/datasets.jl:528 [inlined]
 [4] write_ref(::JLD2.JLDFile{JLD2.MmapIO}, ::Ptr{Nothing}, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data
sets.jl:549
[...]
[68] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/ulg/gher/abarth/.julia/dev/Knet/src/autograd_gpu/gcnode.jl:74

Would it be helpful if I try Serialization.serialize ? (I am currenly using julia version 1.5.2).

denizyuret commented 3 years ago

Aah, we should have checked this. File load/save is kind of a mess right now (https://github.com/JuliaIO/JLD2.jl/issues/40). serialize is a good idea, I think I can recover the relevant variables for debugging from their serialized versions.

Alexander-Barth commented 3 years ago

Here are the files: http://data-assimilation.net/upload/Alex/Knet-tape.tar.gz

-rw-rw-r-- 1 abarth abarth    2936767  3. Dez 20:29 tmp-gcnode_dict
-rw-rw-r-- 1 abarth abarth       8654  3. Dez 20:29 tmp-gcnode_index
-rw-rw-r-- 1 abarth abarth      16263  3. Dez 20:29 tmp-gcnode_queue
-rw-rw-r-- 1 abarth abarth 1685230834  3. Dez 20:29 tmp-gcnode_tape
-rw-rw-r-- 1 abarth abarth          9  3. Dez 20:29 tmp-i
-rw-rw-r-- 1 abarth abarth 1685225093  3. Dez 20:29 tmp-n
-rw-rw-r-- 1 abarth abarth 1685225093  3. Dez 20:29 tmp-parent
-rw-rw-r-- 1 abarth abarth 1685230819  3. Dez 20:29 tmp-tape

The files are pretty large as you see. If you have troubles reading them, I can try to convert them to another format. Thanks again for your help!

denizyuret commented 3 years ago

Thank you, I downloaded the files. However I get the following error when trying to deserialize the tape:

ERROR: KeyError: key DINCAE [0d879ee6-e5ed-4b6f-b65c-f78122b59944] not found

I am using julia-1.5.2 and Knet-1.4.3 assuming that is what you have. However it seems I may not be able to deserialize without your exact Manifest.toml contents and packages.

denizyuret commented 3 years ago

I created a branch which will print out some debug info instead of breaking when the "key not found" error happens. You can use it with pkg"add Knet#dy/fix624". It prints stuff every time it detects the error but keeps going, so you may want to stop it manually if your log gets too big.

andevellicus commented 3 years ago

Attached is my output when I run the code.

I am running: Julia 1.5.3 Knet 1.4.4 #dy/fix624 CUDA 1.2.3

Archlinux RTX 3090 (I had to sell my arm to get my hands on this)

I confirmed that the bug crops up when I'm on Knet master as well, specifically:

Stacktrace:97467256                                                                                                                                                                                                                                                               
 [1] getindex at ./dict.jl:467 [inlined]                                                                                                                                                                                                                                          
 [2] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/andevellicus/.julia/packages/Knet/LdQyF/src/autograd_gpu/gcnode.jl:79                                                                                                                                                      
 [3] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:168                                                                                                      
 [4] differentiate at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]                                                                                                                                                                                 
 [5] minimize!(::UNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:14                                                                                                                                                
 [6] main() at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:89                                                                                                                                                                                                             
 [7] top-level scope at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:113                                                                                                                                                                                                   
 [8] include(::Function, ::Module, ::String) at ./Base.jl:380                                                                                                                                                                                                                     
 [9] include(::Module, ::String) at ./Base.jl:368                                                                                                                                                                                                                                 
 [10] exec_options(::Base.JLOptions) at ./client.jl:296                                                                                                                                                                                                                           
 [11] _start() at ./client.jl:506                                                                                                                                                                                                                                                 
in expression starting at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:113

out.txt

denizyuret commented 3 years ago

Thank you this was very helpful. It seems like gcnode fails to detect the tape change on L67. For the life of me I cannot understand how this could fail. Is there any way I can replicate the error on my machine?

andevellicus commented 3 years ago

My MWE from #610 gave me the same error. You may have to adjust the size of the rand array, but that certainly seems to do the trick on my machine.

denizyuret commented 3 years ago

Finally got it: it turns out Julia can allocate objects with the exact same objectid after a gc. I naively assumed otherwise and relied on objectid to detect when a new @diff call started and the memory saving gcnode infrastructure had to be reset. Anyway it should be fixed now in the dy/fix624 branch, please test.

@maleadt: I ran a profiler on my gcnode and realized that an enormous amount of time was spent on unsafe_free!(::CuArray) - about 30ms per call. Is this expected?

maleadt commented 3 years ago

@maleadt: I ran a profiler on my gcnode and realized that an enormous amount of time was spent on unsafe_free!(::CuArray) - about 30ms per call. Is this expected?

No, there's some bookkeeping to do, but that's an enormous amount of time. Please test with CUDA.jl#master (refcounting has been simplified recently) and file an issue otherwise.

andevellicus commented 3 years ago

Seems to solve the issue on my machine.