Closed Alexander-Barth closed 3 years ago
@Alexander-Barth it would be great if we could discover why this is happening. But let's find you a workaround first:
The AutoGrad gc is an optional feature that saves you about 30-40% gpu memory. You can disable it using:
AutoGrad.set_gc_function(AutoGrad.default_gc)
If you need the 30-40% to fit the model on your GPU, you can try upgrading to Knet-1.4.3 and CUDA-2.x to see if this improves stability (in case the problem is with the CUDA-1.x memory manager).
You can can make the following change to gcnode.jl:79
which seems to cause the problem:
pi = get(gcnode_index, objectid(parent), 0)
This will at worst fail to gc some nodes but will not change anything in the computation.
If we can figure out the status of the tape when this happens, I can debug further. Maybe when the line above fails to find the key objectid(parent)
in the dict gcnode_index
and returns 0, you can dump the following relevant variables into a JLD2 file and send it to me for debugging:
n, tape, i, parent, gcnode_queue, gcnode_dict, gcnode_index, gcnode_tape
I did not see that there was a newer release. I think the best is to upgrade to verify that this issue is still present. Thank you for all your insights!
Concerning point 2.: I can confirm that I see the issue is also on Knet-1.4.3 and CUDA-2.0.2. I am now trying point 1 (with Knet master) and so far it run fine (for already 16 hours). Once this run is finished add the debugging code from point 4.
Thanks again for your help!
I seem to be getting this pretty consistently now as well.
CUDA v2.0.2. If I use CUDA v2.3 I get #632 issues Knet 1.4.3 Julia 1.5.3
EDIT: Still getting this with CUDA v2.3 EDIT2: Point 3 didn't fix it.
So when I implemented Point 3, the following showed up:
[1] getindex at ./dict.jl:467 [inlined]
[2] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/xxx/.julia/packages/Knet/OYNCT/src/autograd_gpu/gcnode.jl:71
[3] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
[4] differentiate at /home/xxx/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
[5] minimize!(::UNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at /home/xxx/Code/julia/knet-sdh-seg/train.jl:14
[6] main() at /home/xxx/Code/julia/knet-sdh-seg/train.jl:89
[7] top-level scope at /home/xxx/Code/julia/knet-sdh-seg/train.jl:113
[8] include(::Function, ::Module, ::String) at ./Base.jl:380
[9] include(::Module, ::String) at ./Base.jl:368
[10] exec_options(::Base.JLOptions) at ./client.jl:296
[11] _start() at ./client.jl:506
I did the same for the gcnode.jl:71
and then CUDA freaks out:
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
Stacktrace:
[1] throw_api_error(::CUDA.cudaError_enum) at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:97
[2] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/error.jl:104 [inlined]
[3] cuMemFree_v2 at /home/xxx/.julia/packages/CUDA/YeS8q/lib/utils/call.jl:93 [inlined]
[4] free(::CUDA.Mem.DeviceBuffer) at /home/xxx/.julia/packages/CUDA/YeS8q/lib/cudadrv/memory.jl:77
[5] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:209 [inlined]
[6] macro expansion at ./timing.jl:233 [inlined]
[7] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:208 [inlined]
[8] macro expansion at /home/xxx/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
[9] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:207 [inlined]
[10] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/state.jl:206 [inlined]
[11] actual_free(::CuDevice, ::CUDA.Block) at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:205
[12] pool_free(::CuDevice, ::CUDA.Block) at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool/binned.jl:229
[13] macro expansion at /home/xxx/.julia/packages/TimerOutputs/ZmKD7/src/TimerOutput.jl:206 [inlined]
[14] macro expansion at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:389 [inlined]
[15] macro expansion at ./timing.jl:233 [inlined]
[16] free at /home/xxx/.julia/packages/CUDA/YeS8q/src/pool.jl:388 [inlined]
[17] unsafe_free!(::CuArray{UInt8,1}) at /home/xxx/.julia/packages/CUDA/YeS8q/src/array.jl:42
[18] freeKnetPtrCu(::Knet.KnetArrays.KnetPtr) at /home/xxx/.julia/packages/Knet/OYNCT/src/knetarrays/kptr.jl:239
[19] show_exception_stack(::IOContext{Base.TTY}, ::Array{Any,1}) at ./errorshow.jl:737
[20] display_error(::Base.TTY, ::Array{Any,1}) at ./client.jl:104
[21] display_error(::Array{Any,1}) at ./client.jl:106
[22] #invokelatest#1 at ./essentials.jl:710 [inlined]
[23] invokelatest at ./essentials.jl:709 [inlined]
[24] exec_options(::Base.JLOptions) at ./client.jl:298
[25] _start() at ./client.jl:506
LoadError: WARNING: Error while freeing CuPtr{Nothing}(0x00007f08e0000000):
CUDA.CuError(code=CUDA.cudaError_enum(0x000002bc), meta=nothing)
@andevellicus :
For your information, I am also running the test 4 now (so far no error). I made these changes:
diff --git a/src/Knet.jl b/src/Knet.jl
index 8e5115a..283f9bf 100644
--- a/src/Knet.jl
+++ b/src/Knet.jl
@@ -1,5 +1,7 @@
module Knet
+using JLD2
+
"Construct a path relative to Knet root, e.g. Knet.dir(\"examples\") => \"~/.julia/dev/Knet/examples\""
dir(path...)=joinpath(dirname(@__DIR__),path...)
diff --git a/src/autograd_gpu/gcnode.jl b/src/autograd_gpu/gcnode.jl
index 9f5eeb5..982d794 100644
--- a/src/autograd_gpu/gcnode.jl
+++ b/src/autograd_gpu/gcnode.jl
@@ -2,6 +2,7 @@ using CUDA: CuArray, unsafe_free!
using Knet.CuArrays: cuarrays # Profiling shows most of the time is spent in cuarrays
using Knet.KnetArrays: cuallocator
using AutoGrad: Result, Node, Tape
+using JLD2
# During the back pass we want to make pointers available as soon as we can to save memory
# without waiting for gc. This is risky as we have to make sure the pointers are not going to
@@ -68,6 +69,10 @@ function gcnode(n::Node, tape::Tape)
gcnode_init(tape)
end
tape isa Tape || return
+
+ if !haskey(gcnode_index,objectid(n))
+ JLD2.@save "/home/ulg/gher/abarth/tape.jld2" n tape i parent gcnode_queue gcnode_dict gcnode_index gcnode_tape
+ end
ni = gcnode_index[objectid(n)]
if n.Value isa Result
for c in cuarrays(n.outgrad); gcnode_setindex!(c, ni); end
Unfortunately, it seems that JLD2 cannot handle pointers:
ERROR: LoadError: cannot write a pointer to JLD file
Stacktrace:
[1] h5type(::JLD2.JLDFile{JLD2.MmapIO}, ::Type{Ptr{Nothing}}, ::Any) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data.jl:1084
[2] h5type(::JLD2.JLDFile{JLD2.MmapIO}, ::Ptr{Nothing}) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data.jl:170
[3] write_dataset at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/datasets.jl:528 [inlined]
[4] write_ref(::JLD2.JLDFile{JLD2.MmapIO}, ::Ptr{Nothing}, ::JLD2.JLDWriteSession{Dict{UInt64,JLD2.RelOffset}}) at /home/ulg/gher/abarth/.julia/packages/JLD2/eU405/src/data
sets.jl:549
[...]
[68] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/ulg/gher/abarth/.julia/dev/Knet/src/autograd_gpu/gcnode.jl:74
Would it be helpful if I try Serialization.serialize
? (I am currenly using julia version 1.5.2).
Aah, we should have checked this. File load/save is kind of a mess right now (https://github.com/JuliaIO/JLD2.jl/issues/40). serialize
is a good idea, I think I can recover the relevant variables for debugging from their serialized versions.
Here are the files: http://data-assimilation.net/upload/Alex/Knet-tape.tar.gz
-rw-rw-r-- 1 abarth abarth 2936767 3. Dez 20:29 tmp-gcnode_dict
-rw-rw-r-- 1 abarth abarth 8654 3. Dez 20:29 tmp-gcnode_index
-rw-rw-r-- 1 abarth abarth 16263 3. Dez 20:29 tmp-gcnode_queue
-rw-rw-r-- 1 abarth abarth 1685230834 3. Dez 20:29 tmp-gcnode_tape
-rw-rw-r-- 1 abarth abarth 9 3. Dez 20:29 tmp-i
-rw-rw-r-- 1 abarth abarth 1685225093 3. Dez 20:29 tmp-n
-rw-rw-r-- 1 abarth abarth 1685225093 3. Dez 20:29 tmp-parent
-rw-rw-r-- 1 abarth abarth 1685230819 3. Dez 20:29 tmp-tape
The files are pretty large as you see. If you have troubles reading them, I can try to convert them to another format. Thanks again for your help!
Thank you, I downloaded the files. However I get the following error when trying to deserialize the tape:
ERROR: KeyError: key DINCAE [0d879ee6-e5ed-4b6f-b65c-f78122b59944] not found
I am using julia-1.5.2 and Knet-1.4.3 assuming that is what you have. However it seems I may not be able to deserialize without your exact Manifest.toml contents and packages.
I created a branch which will print out some debug info instead of breaking when the "key not found" error happens. You can use it with pkg"add Knet#dy/fix624"
. It prints stuff every time it detects the error but keeps going, so you may want to stop it manually if your log gets too big.
Attached is my output when I run the code.
I am running: Julia 1.5.3 Knet 1.4.4 #dy/fix624 CUDA 1.2.3
Archlinux RTX 3090 (I had to sell my arm to get my hands on this)
I confirmed that the bug crops up when I'm on Knet master as well, specifically:
Stacktrace:97467256
[1] getindex at ./dict.jl:467 [inlined]
[2] gcnode(::AutoGrad.Node, ::AutoGrad.Tape) at /home/andevellicus/.julia/packages/Knet/LdQyF/src/autograd_gpu/gcnode.jl:79
[3] differentiate(::Function; o::Base.Iterators.Pairs{Union{},Union{},Tuple{},NamedTuple{(),Tuple{}}}) at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:168
[4] differentiate at /home/andevellicus/.julia/packages/AutoGrad/VFrAv/src/core.jl:135 [inlined]
[5] minimize!(::UNet3D, ::KnetArray{Float32,5}, ::KnetArray{Float32,5}) at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:14
[6] main() at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:89
[7] top-level scope at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:113
[8] include(::Function, ::Module, ::String) at ./Base.jl:380
[9] include(::Module, ::String) at ./Base.jl:368
[10] exec_options(::Base.JLOptions) at ./client.jl:296
[11] _start() at ./client.jl:506
in expression starting at /home/andevellicus/code/julia/knet-sdh-seg/train.jl:113
Thank you this was very helpful. It seems like gcnode fails to detect the tape change on L67. For the life of me I cannot understand how this could fail. Is there any way I can replicate the error on my machine?
My MWE from #610 gave me the same error. You may have to adjust the size of the rand array, but that certainly seems to do the trick on my machine.
Finally got it: it turns out Julia can allocate objects with the exact same objectid after a gc. I naively assumed otherwise and relied on objectid to detect when a new @diff
call started and the memory saving gcnode
infrastructure had to be reset. Anyway it should be fixed now in the dy/fix624 branch, please test.
@maleadt: I ran a profiler on my gcnode and realized that an enormous amount of time was spent on unsafe_free!(::CuArray) - about 30ms per call. Is this expected?
@maleadt: I ran a profiler on my gcnode and realized that an enormous amount of time was spent on unsafe_free!(::CuArray) - about 30ms per call. Is this expected?
No, there's some bookkeeping to do, but that's an enormous amount of time. Please test with CUDA.jl#master (refcounting has been simplified recently) and file an issue otherwise.
Seems to solve the issue on my machine.
I am seeing this error below in Knet (presumable Autograd garbage collectors) about 1 in 4 times. In my case I use the python module
skopt
to optimize some hyperparameter (called with PyCall). In the same Julia program, I run thus multiple training loops sequentially (with different numbers of layers, L2 regularization,...). It works for 3 iterations but at the 4th iteration I get this error during the 1st epoch.In a different case, where I use the same code but slightly more input features this issue does not show up.
This error appears in about 6 to 7 hours of computing (which does not help to create a minimal reproducer).
I call these commands after every time a neural network is trained and validated.
Are there any ideas what I could try to make the training more stable?
Thank you for your hard work on Knet which is very helpful in my work! On a different computer (with only 8 GB GPU RAM) , Knet runs rock solid for more than 10 days (and is still running)!
These are the versions that I use:
Julia: 1.5.2 Knet v1.4.2 CUDA v1.3.3 (with CUDA: 11.0) AutoGrad v1.2.3 GPU: Tesla V100-PCIE OS: Linux