Open lmh91 opened 2 years ago
After calling the function
eval_model
the amount of used GPU memory should be the same as before.
That's a wrong expectation. For one, memory allocations are garbage collected so it might take a while before they get freed, but secondly there's a caching layer in libcuda
which will make that device memory usage looks higher while the memory is available for reuse. That's what the very line below what you're pointing at explains.
@maleadt but why the overloading error - shouldn't GPU memory be GC'ed automatically if necessary?
The issue didn't demonstrate an actual OOM, so I'm guessing that statement was a hypothetical? Either way, it shouldn't OOM, our allocator will forcibly free memory (by calling the GC) if some is needed.
@lmh91 could you change your example to demo the OOM errors we've encountered?
I tried to make the MWE example as small as possible and it seems I actually removed the important part which seems to create the OOM error in my use case: using multiple threads. I updated the bug report and MWE. Sorry for that...
When running the for
-loop single-threaded no OOM error is produced.
(Though the "effective GPU memory" is 99,99% very quickly even if one iteration does
not require much memory which I find a bit strange.)
The OOM error is produced running the loop over multiple threads via Base.Threads.@threads
.
Thanks, I can reproduce using this MWE. Will have a look.
MWE:
using CUDA
function main()
Threads.@threads for i in 1:100000
CuArray{Float32}(undef, (1024, 100))
nothing
end
end
isinteractive() || main()
This looks like us calling into the GC being broken when using threads.
I can reproduce using this MWE. Will have a look.
Thanks @maleadt !
Even smaller:
using CUDA
function main()
Threads.@threads for i in 1:30
CuArray{UInt8}(undef, (1024, 1024, 1024)) # 1 GiB
nothing
end
end
isinteractive() || main()
OOMs on -t2
, not on -t1
. Looks like our calls to GC.gc
are ineffective, or at least insufficient to free another thread's dead objects:
t1: try alloc 1024.000 MiB
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000302000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x0000000342000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000382000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x00000003c2000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000402000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x0000000442000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000482000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x00000004c2000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000502000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x0000000542000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000582000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x00000005c2000000)
t2: try alloc 1024.000 MiB
t1: alloc CuPtr{Nothing}(0x0000000602000000)
t1: try alloc 1024.000 MiB
t2: alloc CuPtr{Nothing}(0x0000000642000000)
t2: try alloc 1024.000 MiB
t1: alloc re-try 1
t2: alloc re-try 1
t1: alloc re-try 2
t2: alloc re-try 2
t1: alloc re-try 3
t2: alloc re-try 3
t2: alloc re-try 4
t2: alloc re-try 5
t2: alloc re-try 6
t2: alloc failed
t1: free CuPtr{Nothing}(0x0000000642000000)
t1: free CuPtr{Nothing}(0x00000005c2000000)
t1: free CuPtr{Nothing}(0x0000000542000000)
t1: free CuPtr{Nothing}(0x00000004c2000000)
t1: free CuPtr{Nothing}(0x0000000442000000)
t1: free CuPtr{Nothing}(0x00000003c2000000)
t1: free CuPtr{Nothing}(0x0000000342000000)
t1: free CuPtr{Nothing}(0x0000000602000000)
t1: free CuPtr{Nothing}(0x0000000582000000)
t1: free CuPtr{Nothing}(0x0000000502000000)
t1: free CuPtr{Nothing}(0x0000000482000000)
t1: free CuPtr{Nothing}(0x0000000402000000)
t1: free CuPtr{Nothing}(0x0000000382000000)
t1: free CuPtr{Nothing}(0x0000000302000000)
During those retries, we do incrementally more effort to free memory, including calls to the GC. But as you can see from the trace, those don't free another thread's objects in time. I'm not sure how to proceed here, so I've asked @chflood to have a look.
MWE without CUDA.jl:
const LIMIT = 14
# dummy atomic allocator
const memory = Threads.Atomic{Int}(0)
function alloc()
println("thread $(Threads.threadid()): try alloc ($(memory[])/$(LIMIT) used)")
while true
old_memory = memory[]
new_memory = old_memory + 1
if new_memory > LIMIT
printstyled("thread $(Threads.threadid()): alloc failure\n"; color=:yellow)
return false
end
if Threads.atomic_cas!(memory, old_memory, new_memory) == old_memory
println("thread $(Threads.threadid()): alloc success")
return true
end
end
end
function free()
printstyled("thread $(Threads.threadid()): free ($(memory[])/$(LIMIT) used)\n"; color=:green)
while true
old_memory = memory[]
new_memory = old_memory - 1
@assert new_memory >= 0
if Threads.atomic_cas!(memory, old_memory, new_memory) == old_memory
return
end
end
end
# dummy array
mutable struct CuArray
function CuArray()
success = alloc()
if !success
printstyled("thread $(Threads.threadid()): GC.gc(false)\n"; color=:magenta)
GC.gc(false)
success = alloc()
end
if !success
printstyled("thread $(Threads.threadid()): GC.gc(true)\n"; color=:magenta)
GC.gc(true)
success = alloc()
end
if !success
printstyled("thread $(Threads.threadid()): alloc really failed\n"; color=:red)
throw(OutOfMemoryError())
end
obj = new()
finalizer(obj) do _
free()
end
end
end
function main()
Threads.@threads for i in 1:30
CuArray()
nothing
end
end
isinteractive() || main()
I am seeing also an GPU OOM while training a neural network with CUDA.jl 5.1.1 and julia 1.9.4 and 2 threads (works fine with 1 thread). The CUDA MWE also fails on my system (NVIDIA A100-SXM4-40GB).
Is there any known work-around? I already tried to downgrade CUDA.jl, but without success. Thank you for your time!
I am trying to add CUDA.reclaim()
, but this leads to this error:
That is a different issue; what's described here is that our GC calls are ineffective with multiple threads, leading to an OOM. You're describing an error that shouldn't occur. Please file a new issue with an MWE so that I can take a look!
The bug
GPU memory is not freed (fast enough?) when performing a little memory requiring computation in parallel on multiple threads.
MWE
I came across this issue when using a multi layer ML model via Flux.jl on a GPU in an multi-threaded optimizer. I was able to reproduce the issue with only CUDA.jl and Adapt.jl:
My output of running the script via
julia --project=. main.jl
:Manifest.toml
``` # This file is machine-generated - editing it directly is not advised julia_version = "1.7.2" manifest_format = "2.0" [[deps.AbstractFFTs]] deps = ["ChainRulesCore", "LinearAlgebra"] git-tree-sha1 = "6f1d9bc1c08f9f4a8fa92e3ea3cb50153a1b40d4" uuid = "621f4979-c628-5d54-868e-fcf4e3e8185c" version = "1.1.0" [[deps.Adapt]] deps = ["LinearAlgebra"] git-tree-sha1 = "af92965fb30777147966f58acb05da51c5616b5f" uuid = "79e6a3ab-5dfb-504d-930d-738a2a938a0e" version = "3.3.3" [[deps.ArgTools]] uuid = "0dad84c5-d112-42e6-8d28-ef12dabb789f" [[deps.Artifacts]] uuid = "56f22d72-fd6d-98f1-02f0-08ddc0907c33" [[deps.BFloat16s]] deps = ["LinearAlgebra", "Printf", "Random", "Test"] git-tree-sha1 = "a598ecb0d717092b5539dbbe890c98bac842b072" uuid = "ab4f0b2a-ad5b-11e8-123f-65d77653426b" version = "0.2.0" [[deps.Base64]] uuid = "2a0f44e3-6c83-55bd-87e4-b1978d98bd5f" [[deps.CEnum]] git-tree-sha1 = "eb4cb44a499229b3b8426dcfb5dd85333951ff90" uuid = "fa961155-64e5-5f13-b03f-caf6b980ea82" version = "0.4.2" [[deps.CUDA]] deps = ["AbstractFFTs", "Adapt", "BFloat16s", "CEnum", "CompilerSupportLibraries_jll", "ExprTools", "GPUArrays", "GPUCompiler", "LLVM", "LazyArtifacts", "Libdl", "LinearAlgebra", "Logging", "Printf", "Random", "Random123", "RandomNumbers", "Reexport", "Requires", "SparseArrays", "SpecialFunctions", "TimerOutputs"] git-tree-sha1 = "19fb33957a5f85efb3cc10e70cf4dd4e30174ac9" uuid = "052768ef-5323-5732-b1bb-66c8b64840ba" version = "3.10.0" [[deps.ChainRulesCore]] deps = ["Compat", "LinearAlgebra", "SparseArrays"] git-tree-sha1 = "9950387274246d08af38f6eef8cb5480862a435f" uuid = "d360d2e6-b24c-11e9-a2a3-2a2ae2dbcce4" version = "1.14.0" [[deps.ChangesOfVariables]] deps = ["ChainRulesCore", "LinearAlgebra", "Test"] git-tree-sha1 = "1e315e3f4b0b7ce40feded39c73049692126cf53" uuid = "9e997f8a-9a97-42d5-a9f1-ce6bfc15e2c0" version = "0.1.3" [[deps.Compat]] deps = ["Base64", "Dates", "DelimitedFiles", "Distributed", "InteractiveUtils", "LibGit2", "Libdl", "LinearAlgebra", "Markdown", "Mmap", "Pkg", "Printf", "REPL", "Random", "SHA", "Serialization", "SharedArrays", "Sockets", "SparseArrays", "Statistics", "Test", "UUIDs", "Unicode"] git-tree-sha1 = "b153278a25dd42c65abbf4e62344f9d22e59191b" uuid = "34da2185-b29b-5c13-b0c7-acf172513d20" version = "3.43.0" [[deps.CompilerSupportLibraries_jll]] deps = ["Artifacts", "Libdl"] uuid = "e66e0078-7015-5450-92f7-15fbd957f2ae" [[deps.Dates]] deps = ["Printf"] uuid = "ade2ca70-3891-5945-98fb-dc099432e06a" [[deps.DelimitedFiles]] deps = ["Mmap"] uuid = "8bb1440f-4735-579b-a4ab-409b98df4dab" [[deps.Distributed]] deps = ["Random", "Serialization", "Sockets"] uuid = "8ba89e20-285c-5b6f-9357-94700520ee1b" [[deps.DocStringExtensions]] deps = ["LibGit2"] git-tree-sha1 = "b19534d1895d702889b219c382a6e18010797f0b" uuid = "ffbed154-4ef7-542d-bbb7-c09d3a79fcae" version = "0.8.6" [[deps.Downloads]] deps = ["ArgTools", "LibCURL", "NetworkOptions"] uuid = "f43a241f-c20a-4ad4-852c-f6b1247861c6" [[deps.ExprTools]] git-tree-sha1 = "56559bbef6ca5ea0c0818fa5c90320398a6fbf8d" uuid = "e2ba6199-217a-4e67-a87a-7c52f15ade04" version = "0.1.8" [[deps.GPUArrays]] deps = ["Adapt", "LLVM", "LinearAlgebra", "Printf", "Random", "Serialization", "Statistics"] git-tree-sha1 = "c783e8883028bf26fb05ed4022c450ef44edd875" uuid = "0c68f7d7-f131-5f86-a1c3-88cf8149b2d7" version = "8.3.2" [[deps.GPUCompiler]] deps = ["ExprTools", "InteractiveUtils", "LLVM", "Libdl", "Logging", "TimerOutputs", "UUIDs"] git-tree-sha1 = "d8c5999631e1dc18d767883f621639c838f8e632" uuid = "61eb1bfa-7361-4325-ad38-22787b887f55" version = "0.15.2" [[deps.InteractiveUtils]] deps = ["Markdown"] uuid = "b77e0a4c-d291-57a0-90e8-8db25a27a240" [[deps.InverseFunctions]] deps = ["Test"] git-tree-sha1 = "336cc738f03e069ef2cac55a104eb823455dca75" uuid = "3587e190-3f89-42d0-90ee-14403ec27112" version = "0.1.4" [[deps.IrrationalConstants]] git-tree-sha1 = "7fd44fd4ff43fc60815f8e764c0f352b83c49151" uuid = "92d709cd-6900-40b7-9082-c6be49f344b6" version = "0.1.1" [[deps.JLLWrappers]] deps = ["Preferences"] git-tree-sha1 = "abc9885a7ca2052a736a600f7fa66209f96506e1" uuid = "692b3bcd-3c85-4b1f-b108-f13ce0eb3210" version = "1.4.1" [[deps.LLVM]] deps = ["CEnum", "LLVMExtra_jll", "Libdl", "Printf", "Unicode"] git-tree-sha1 = "c8d47589611803a0f3b4813d9e267cd4e3dbcefb" uuid = "929cbde3-209d-540e-8aea-75f648917ca0" version = "4.11.1" [[deps.LLVMExtra_jll]] deps = ["Artifacts", "JLLWrappers", "LazyArtifacts", "Libdl", "Pkg", "TOML"] git-tree-sha1 = "771bfe376249626d3ca12bcd58ba243d3f961576" uuid = "dad2f222-ce93-54a1-a47d-0025e8a3acab" version = "0.0.16+0" [[deps.LazyArtifacts]] deps = ["Artifacts", "Pkg"] uuid = "4af54fe1-eca0-43a8-85a7-787d91b784e3" [[deps.LibCURL]] deps = ["LibCURL_jll", "MozillaCACerts_jll"] uuid = "b27032c2-a3e7-50c8-80cd-2d36dbcbfd21" [[deps.LibCURL_jll]] deps = ["Artifacts", "LibSSH2_jll", "Libdl", "MbedTLS_jll", "Zlib_jll", "nghttp2_jll"] uuid = "deac9b47-8bc7-5906-a0fe-35ac56dc84c0" [[deps.LibGit2]] deps = ["Base64", "NetworkOptions", "Printf", "SHA"] uuid = "76f85450-5226-5b5a-8eaa-529ad045b433" [[deps.LibSSH2_jll]] deps = ["Artifacts", "Libdl", "MbedTLS_jll"] uuid = "29816b5a-b9ab-546f-933c-edad1886dfa8" [[deps.Libdl]] uuid = "8f399da3-3557-5675-b5ff-fb832c97cbdb" [[deps.LinearAlgebra]] deps = ["Libdl", "libblastrampoline_jll"] uuid = "37e2e46d-f89d-539d-b4ee-838fcccc9c8e" [[deps.LogExpFunctions]] deps = ["ChainRulesCore", "ChangesOfVariables", "DocStringExtensions", "InverseFunctions", "IrrationalConstants", "LinearAlgebra"] git-tree-sha1 = "09e4b894ce6a976c354a69041a04748180d43637" uuid = "2ab3a3ac-af41-5b50-aa03-7779005ae688" version = "0.3.15" [[deps.Logging]] uuid = "56ddb016-857b-54e1-b83d-db4d58db5568" [[deps.Markdown]] deps = ["Base64"] uuid = "d6f4376e-aef5-505a-96c1-9c027394607a" [[deps.MbedTLS_jll]] deps = ["Artifacts", "Libdl"] uuid = "c8ffd9c3-330d-5841-b78e-0817d7145fa1" [[deps.Mmap]] uuid = "a63ad114-7e13-5084-954f-fe012c677804" [[deps.MozillaCACerts_jll]] uuid = "14a3606d-f60d-562e-9121-12d972cd8159" [[deps.NetworkOptions]] uuid = "ca575930-c2e3-43a9-ace4-1e988b2c1908" [[deps.OpenBLAS_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "Libdl"] uuid = "4536629a-c528-5b80-bd46-f80d51c5b363" [[deps.OpenLibm_jll]] deps = ["Artifacts", "Libdl"] uuid = "05823500-19ac-5b8b-9628-191a04bc5112" [[deps.OpenSpecFun_jll]] deps = ["Artifacts", "CompilerSupportLibraries_jll", "JLLWrappers", "Libdl", "Pkg"] git-tree-sha1 = "13652491f6856acfd2db29360e1bbcd4565d04f1" uuid = "efe28fd5-8261-553b-a9e1-b2916fc3738e" version = "0.5.5+0" [[deps.Pkg]] deps = ["Artifacts", "Dates", "Downloads", "LibGit2", "Libdl", "Logging", "Markdown", "Printf", "REPL", "Random", "SHA", "Serialization", "TOML", "Tar", "UUIDs", "p7zip_jll"] uuid = "44cfe95a-1eb2-52ea-b672-e2afdf69b78f" [[deps.Preferences]] deps = ["TOML"] git-tree-sha1 = "47e5f437cc0e7ef2ce8406ce1e7e24d44915f88d" uuid = "21216c6a-2e73-6563-6e65-726566657250" version = "1.3.0" [[deps.Printf]] deps = ["Unicode"] uuid = "de0858da-6303-5e67-8744-51eddeeeb8d7" [[deps.REPL]] deps = ["InteractiveUtils", "Markdown", "Sockets", "Unicode"] uuid = "3fa0cd96-eef1-5676-8a61-b3b8758bbffb" [[deps.Random]] deps = ["SHA", "Serialization"] uuid = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c" [[deps.Random123]] deps = ["Random", "RandomNumbers"] git-tree-sha1 = "afeacaecf4ed1649555a19cb2cad3c141bbc9474" uuid = "74087812-796a-5b5d-8853-05524746bad3" version = "1.5.0" [[deps.RandomNumbers]] deps = ["Random", "Requires"] git-tree-sha1 = "043da614cc7e95c703498a491e2c21f58a2b8111" uuid = "e6cf234a-135c-5ec9-84dd-332b85af5143" version = "1.5.3" [[deps.Reexport]] git-tree-sha1 = "45e428421666073eab6f2da5c9d310d99bb12f9b" uuid = "189a3867-3050-52da-a836-e630ba90ab69" version = "1.2.2" [[deps.Requires]] deps = ["UUIDs"] git-tree-sha1 = "838a3a4188e2ded87a4f9f184b4b0d78a1e91cb7" uuid = "ae029012-a4dd-5104-9daa-d747884805df" version = "1.3.0" [[deps.SHA]] uuid = "ea8e919c-243c-51af-8825-aaa63cd721ce" [[deps.Serialization]] uuid = "9e88b42a-f829-5b0c-bbe9-9e923198166b" [[deps.SharedArrays]] deps = ["Distributed", "Mmap", "Random", "Serialization"] uuid = "1a1011a3-84de-559e-8e89-a11a2f7dc383" [[deps.Sockets]] uuid = "6462fe0b-24de-5631-8697-dd941f90decc" [[deps.SparseArrays]] deps = ["LinearAlgebra", "Random"] uuid = "2f01184e-e22b-5df5-ae63-d93ebab69eaf" [[deps.SpecialFunctions]] deps = ["ChainRulesCore", "IrrationalConstants", "LogExpFunctions", "OpenLibm_jll", "OpenSpecFun_jll"] git-tree-sha1 = "bc40f042cfcc56230f781d92db71f0e21496dffd" uuid = "276daf66-3868-5448-9aa4-cd146d93841b" version = "2.1.5" [[deps.Statistics]] deps = ["LinearAlgebra", "SparseArrays"] uuid = "10745b16-79ce-11e8-11f9-7d13ad32a3b2" [[deps.TOML]] deps = ["Dates"] uuid = "fa267f1f-6049-4f14-aa54-33bafae1ed76" [[deps.Tar]] deps = ["ArgTools", "SHA"] uuid = "a4e569a6-e804-4fa4-b0f3-eef7a1d5b13e" [[deps.Test]] deps = ["InteractiveUtils", "Logging", "Random", "Serialization"] uuid = "8dfed614-e22c-5e08-85e1-65c5234f0b40" [[deps.TimerOutputs]] deps = ["ExprTools", "Printf"] git-tree-sha1 = "7638550aaea1c9a1e86817a231ef0faa9aca79bd" uuid = "a759f4b9-e2f1-59dc-863e-4aeb61b1ea8f" version = "0.5.19" [[deps.UUIDs]] deps = ["Random", "SHA"] uuid = "cf7118a7-6976-5b1a-9a39-7adc72f591a4" [[deps.Unicode]] uuid = "4ec0a83e-493e-50e2-b9ac-8f72acf5a8f5" [[deps.Zlib_jll]] deps = ["Libdl"] uuid = "83775a58-1f1d-513f-b197-d71354ab007a" [[deps.libblastrampoline_jll]] deps = ["Artifacts", "Libdl", "OpenBLAS_jll"] uuid = "8e850b90-86db-534c-a0d3-1478176c7d93" [[deps.nghttp2_jll]] deps = ["Artifacts", "Libdl"] uuid = "8e850ede-7688-5339-a07c-302acd2aaf8d" [[deps.p7zip_jll]] deps = ["Artifacts", "Libdl"] uuid = "3f19e933-33d8-53b3-aaab-bd5110c3b7a0" ```
Expected behavior
The "permanent" GPU memories (the layers) and the function
eval_model
only uses a very small amount of the GPU memory. Even when performing the function in parallel on multiple threads there should not be any GPU memory issue (with the size of the arrays in the above MWE). However, a OOM error is produced.Version info
Details on Julia:
Details on CUDA: