Closed TacHawkes closed 1 year ago
https://github.com/JuliaIO/HDF5.jl/pull/1024 might be able to help with this.
It might help to call GC.gc()
to manually engage garbage collection.
Another alternative would be to use GC.enable(false)
or GC.enable_finalizers(false)
before entering the tight loop.
What version of HDF5.jl are you using?
Also, it would be helpful if you can give a full minimal reproducible example.
This is with v0.16.13
Huh, we added locking around API calls in #1021, which should be in that version.
Do you still see the same issue if you remove the lock
?
Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...
using HDF5
fid = h5open(tempname() * ".h5", "w")
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
I am unable to replicate it on my Mac. I don't have a Windows machine to try it on, unfortunately.
Do you see it if you run it outside VSCode? (i.e. if you just run it in a script from the command line)
I'm also on a Mac at the moment. It also crashes from the cmd line but with a slightly different trace:
error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
ijl_error at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_switch at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
try_yieldto at ./task.jl:899
wait at ./task.jl:973
#wait#621 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/user/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/user/.julia/packages/HDF5/TcavY/src/objects.jl:2 [inlined]
close at /Users/user/.julia/packages/HDF5/TcavY/src/dataspaces.jl:25
unknown function (ip: 0x115d4005f)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizers at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_task_get_next at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
poptask at ./task.jl:963
wait at ./task.jl:972
task_done_hook at ./task.jl:672
jfptr_task_done_hook_27918 at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_finish_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
start_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ERROR: LoadError: TaskFailedException
Stacktrace:
[1] wait(t::Task)
@ Base ./task.jl:349
[2] top-level scope
@ ~/.julia/dev/HDF5Test/test/rectest.jl:41
nested task error: schedule: Task not runnable
Stacktrace:
[1] error(s::String)
@ Base ./error.jl:35
[2] schedule(t::Task, arg::Any; error::Bool)
@ Base ./task.jl:829
[3] notify(c::Base.GenericCondition{Base.Threads.SpinLock}, arg::Any, all::Bool, error::Bool)
@ Base ./condition.jl:154
[4] #notify#622
@ ./condition.jl:148 [inlined]
[5] notify (repeats 2 times)
@ ./condition.jl:148 [inlined]
[6] (::Base.var"#notifywaiters#625")(rl::ReentrantLock)
@ Base ./lock.jl:187
[7] (::Base.var"#_unlock#624")(rl::ReentrantLock)
@ Base ./lock.jl:183
[8] unlock
@ ./lock.jl:177 [inlined]
[9] h5t_get_native_type(dtype_id::HDF5.Datatype, direction::Int64)
@ HDF5.API ~/.julia/packages/HDF5/TcavY/src/api/functions.jl:7026
[10] h5t_get_native_type
@ ~/.julia/packages/HDF5/TcavY/src/api/helpers.jl:891 [inlined]
[11] _memtype(filetype::HDF5.Datatype, #unused#::Type{Int64})
@ HDF5 ~/.julia/packages/HDF5/TcavY/src/readwrite.jl:290
[12] setindex!(::HDF5.Dataset, ::Matrix{Int64}, ::Int64, ::Int64)
@ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:327
[13] setindex!(::HDF5.Dataset, ::Int64, ::Int64, ::Int64)
@ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:357
[14] macro expansion
@ ~/.julia/dev/HDF5Test/test/rectest.jl:36 [inlined]
[15] (::var"#3#4")()
@ Main ./threadingconstructs.jl:340
Versioninfo for this platform:
Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
OS: macOS (arm64-apple-darwin21.4.0)
CPU: 10 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
Threads: 8 on 8 virtual cores
I still can't replicate it. What is your full status -m
?
Can you also try running it with --startup-file=no
to make sure you're not picking up anything in your startup file?
My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.
Nonetheless, this points to a fundamental issue. We should not be locking and unlocking during finalization from a different task.
I guess I still don't understand how this can happen after #1021.
I guess I still don't understand how this can happen after #1021.
The problem was caused by #1021. It is because the finalizer tries to call close
. As of #1021, close
tries to lock
and unlock
, which causes a potential task switch in the finalizer and that is not allowed.
I raised this and you responded in https://github.com/JuliaIO/HDF5.jl/pull/1021#issuecomment-1306659168 . So either https://github.com/JuliaLang/julia/pull/38487 does not actually address this or there has been a regression.
My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.
You're right: I was able to trigger it with --heap-size-hint=50M
.
Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...
Does GC.enable_finalizers(false)
/ GC.enable_finalizers(true)
also fix the issue?
Does
GC.enable_finalizers(false)
/GC.enable_finalizers(true)
also fix the issue?
Yes.
So far I am unable to replicate on Windows.
I can replicate on macOS but not consistently. It seemed to occur more frequently when I first logged in.
Possibly related: https://github.com/JuliaLang/julia/issues/47612
I think we need to use trylock
in the finalizers.
function try_close_finalizer(x)
# Test and test lock, see trylock docstring
# Only proceed if the finalizer thread can acquire the lock, do not wait
if(!islocked(API.liblock) && trylock(API.liblock))
try
close(x)
finally
unlock(API.liblock)
end
else
# try again later
# TODO: consider exponential backoff
finalizer(try_close_finalizer, x)
end
end
User code may want to try to acquire the API lock. Shouldn't this disable finalizers due to https://github.com/JuliaLang/julia/blob/72bebdf2715f94bad52c701cfb867744a5b070d6/base/lock.jl#L96 ?
edit: This makes the problem worse.
t = Threads.@spawn begin
lock(HDF5.API.liblock)
try
for i=1:1_000_000
d[i,1] = 1
end
finally
unlock(HDF5.API.liblock)
end
end
or maybe
t = Threads.@spawn begin
@lock HDF5.API.liblock begin
for i=1:1_000_000
d[i,1] = 1
end
end
end
On macOS, once I find a terminal process that causes the error, then this reliably creates a problem:
$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
using HDF5
fid = h5open(tempname() * ".h5", "w")
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
end
error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:19 [inlined]
close at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:11
unknown function (ip: 0x1170d405f)
ijl_apply_generic at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
...
Meanwhile, the using do
with h5open
does not create a problem:
$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
using HDF5
h5open(tempname() * ".h5", "w") do fid
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn begin
for i=1:1_000_000
d[i,1] = 1
end
end
wait(t)
end
end
Hmm... acquiring the lock causes a lot of error message scrolling.
julia> begin
using HDF5
fid = h5open(tempname() * ".h5", "w")
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn @lock HDF5.API.liblock begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
end
Using the do
block seems to not cause the problem:
julia> begin
using HDF5
h5open(tempname() * ".h5", "w") do fid
d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))
t = Threads.@spawn @lock HDF5.API.liblock begin
#GC.enable(false)
for i=1:1_000_000
d[i,1] = 1
end
#GC.enable(true)
end
wait(t)
close(fid)
end
end
Does
GC.enable_finalizers(false)
/GC.enable_finalizers(true)
also fix the issue?Yes.
This only avoids the error if I use GC.enable_finalizers
outside of the Threads.@spawn
. Using them inside causes much error scrolling.
Implementing try_close_finalizer
above appears to resolve the issue.
diff --git a/src/HDF5.jl b/src/HDF5.jl
index fb95abf..9e29340 100644
--- a/src/HDF5.jl
+++ b/src/HDF5.jl
@@ -62,8 +62,24 @@ export @read,
# H5DataStore, Attribute, File, Group, Dataset, Datatype, Opaque,
# Dataspace, Object, Properties, VLen, ChunkStorage, Reference
+
+
h5doc(name) = "[`$name`](https://portal.hdfgroup.org/display/HDF5/$(name))"
+function try_close_finalizer(x)
+ if !islocked(API.liblock) && trylock(API.liblock)
+ try
+ close(x)
+ finally
+ unlock(API.liblock)
+ end
+ else
+ finalizer(try_close_finalizer, x)
+ end
+end
+#const try_close_finalizer = Base.close
+
+
include("api/api.jl")
include("properties.jl")
include("context.jl")
diff --git a/src/properties.jl b/src/properties.jl
index 1ca0033..88f16f9 100644
--- a/src/properties.jl
+++ b/src/properties.jl
@@ -104,7 +104,7 @@ macro propertyclass(name, classid)
id::API.hid_t
function $name(id::API.hid_t)
obj = new(id)
- finalizer(close, obj)
+ finalizer(try_close_finalizer, obj)
obj
end
end
diff --git a/src/types.jl b/src/types.jl
index eea17a9..980d240 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -5,6 +5,7 @@
# Supertype of HDF5.File, HDF5.Group, JldFile, JldGroup, Matlabv5File, and MatlabHDF5File.
abstract type H5DataStore end
+
# Read a list of variables, read(parent, "A", "B", "x", ...)
function Base.read(parent::H5DataStore, name::AbstractString...)
tuple((read(parent, x) for x in name)...)
@@ -41,7 +42,7 @@ mutable struct File <: H5DataStore
function File(id, filename, toclose::Bool=true)
f = new(id, filename)
if toclose
- finalizer(close, f)
+ finalizer(try_close_finalizer, f)
end
f
end
@@ -55,7 +56,7 @@ mutable struct Group <: H5DataStore
function Group(id, file)
g = new(id, file)
- finalizer(close, g)
+ finalizer(try_close_finalizer, g)
g
end
end
@@ -69,7 +70,7 @@ mutable struct Dataset
function Dataset(id, file, xfer=DatasetTransferProperties())
dset = new(id, file, xfer)
- finalizer(close, dset)
+ finalizer(try_close_finalizer, dset)
dset
end
end
@@ -84,14 +85,14 @@ mutable struct Datatype
function Datatype(id, toclose::Bool=true)
nt = new(id, toclose)
if toclose
- finalizer(close, nt)
+ finalizer(try_close_finalizer, nt)
end
nt
end
function Datatype(id, file::File, toclose::Bool=true)
nt = new(id, toclose, file)
if toclose
- finalizer(close, nt)
+ finalizer(try_close_finalizer, nt)
end
nt
end
@@ -106,7 +107,7 @@ mutable struct Dataspace
function Dataspace(id)
dspace = new(id)
- finalizer(close, dspace)
+ finalizer(try_close_finalizer, dspace)
dspace
end
end
@@ -119,7 +120,7 @@ mutable struct Attribute
function Attribute(id, file)
dset = new(id, file)
- finalizer(close, dset)
+ finalizer(try_close_finalizer, dset)
dset
end
end
https://github.com/JuliaIO/HDF5.jl/pull/1049 fixes this. Big thanks to @mkitti for the fast fix!
I've stumbled over this bug too. Any chance to land the fix and cut a quick bugfix release?
Could you confirm that #1049 fixes the issue for you? If so, I will accelerate the merge.
cc: @musm
Yes, the branch worked.
Hi,
I have the following setup:
If the writes are too fast, the task/thread crashes with the following trace:
Versioninfo: