JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
386 stars 140 forks source link

Finalizer error when rapidly writing to HDF5 #1048

Closed TacHawkes closed 1 year ago

TacHawkes commented 1 year ago

Hi,

I have the following setup:

  1. A main thread which is idling
  2. A thread which writes to a HDF5 dataset faster than 1 ms delay between successive writes, i.e. (just pseudo-MWE)
    Threads.@spawn begin
     for i=1:1_000_000
          lock(some_filelock) do
              hdf5_dataset[startidx:stopidx,1] = somedata
          end
     end
    end

If the writes are too fast, the task/thread crashes with the following trace:

error in running finalizer: ErrorException("val already in a list")
error at .\error.jl:35
push! at .\linked_list.jl:53 [inlined]
_wait2 at .\condition.jl:87
#wait#621 at .\condition.jl:127
wait at .\condition.jl:125 [inlined]
slowlock at .\lock.jl:156
lock at .\lock.jl:147 [inlined]
h5i_is_valid at C:\Users\user\.julia\packages\HDF5\TcavY\src\api\functions.jl:1960
isvalid at C:\Users\user\.julia\packages\HDF5\TcavY\src\objects.jl:2 [inlined]
close at C:\Users\user\.julia\packages\HDF5\TcavY\src\dataspaces.jl:25
unknown function (ip: 000002027f7c9b96)
run_finalizer at C:/workdir/src\gc.c:298
jl_gc_run_finalizers_in_list at C:/workdir/src\gc.c:388
run_finalizers at C:/workdir/src\gc.c:434
jl_mutex_unlock at C:/workdir/src\julia_locks.h:81 [inlined]
ijl_task_get_next at C:/workdir/src\partr.c:385
poptask at .\task.jl:963
wait at .\task.jl:972
#wait#621 at .\condition.jl:130
wait at .\condition.jl:125 [inlined]
take_buffered at .\channels.jl:456
take! at .\channels.jl:450 [inlined]
iterate at .\channels.jl:592
macro expansion at c:\Users\user\.vscode\extensions\julialang.language-julia-1.41.1\scripts\packages\JSONRPC\src\core.jl:103 [inlined]
#1 at .\task.jl:514
unknown function (ip: 000002027f6ffb33)
jl_apply at C:/workdir/src\julia.h:1874 [inlined]
start_task at C:/workdir/src\task.c:1092

Versioninfo:

Julia Version 1.9.0-beta3
Commit 24204a7344 (2023-01-18 07:20 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 8 × Intel(R) Core(TM) i7-10610U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 8 on 8 virtual cores
Environment:
  JULIA_EDITOR = code
  JULIA_NUM_THREADS = 8
mkitti commented 1 year ago

https://github.com/JuliaIO/HDF5.jl/pull/1024 might be able to help with this.

It might help to call GC.gc() to manually engage garbage collection.

Another alternative would be to use GC.enable(false) or GC.enable_finalizers(false) before entering the tight loop.

simonbyrne commented 1 year ago

What version of HDF5.jl are you using?

simonbyrne commented 1 year ago

Also, it would be helpful if you can give a full minimal reproducible example.

TacHawkes commented 1 year ago

This is with v0.16.13

simonbyrne commented 1 year ago

Huh, we added locking around API calls in #1021, which should be in that version.

Do you still see the same issue if you remove the lock?

TacHawkes commented 1 year ago

Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...

using HDF5

fid = h5open(tempname() * ".h5", "w")

d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

t = Threads.@spawn begin
    #GC.enable(false)
    for i=1:1_000_000
        d[i,1] = 1
    end
    #GC.enable(true)
end

wait(t)

close(fid)
simonbyrne commented 1 year ago

I am unable to replicate it on my Mac. I don't have a Windows machine to try it on, unfortunately.

Do you see it if you run it outside VSCode? (i.e. if you just run it in a script from the command line)

TacHawkes commented 1 year ago

I'm also on a Mac at the moment. It also crashes from the cmd line but with a slightly different trace:

error in running finalizer: ErrorException("task switch not allowed from inside gc finalizer")
ijl_error at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_switch at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
try_yieldto at ./task.jl:899
wait at ./task.jl:973
#wait#621 at ./condition.jl:130
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/user/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/user/.julia/packages/HDF5/TcavY/src/objects.jl:2 [inlined]
close at /Users/user/.julia/packages/HDF5/TcavY/src/dataspaces.jl:25
unknown function (ip: 0x115d4005f)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizers at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ijl_task_get_next at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
poptask at ./task.jl:963
wait at ./task.jl:972
task_done_hook at ./task.jl:672
jfptr_task_done_hook_27918 at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/sys.dylib (unknown line)
ijl_apply_generic at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_finish_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
start_task at /Applications/Julia-1.9.app/Contents/Resources/julia/lib/julia/libjulia-internal.1.9.dylib (unknown line)
ERROR: LoadError: TaskFailedException
Stacktrace:
 [1] wait(t::Task)
   @ Base ./task.jl:349
 [2] top-level scope
   @ ~/.julia/dev/HDF5Test/test/rectest.jl:41

    nested task error: schedule: Task not runnable
    Stacktrace:
      [1] error(s::String)
        @ Base ./error.jl:35
      [2] schedule(t::Task, arg::Any; error::Bool)
        @ Base ./task.jl:829
      [3] notify(c::Base.GenericCondition{Base.Threads.SpinLock}, arg::Any, all::Bool, error::Bool)
        @ Base ./condition.jl:154
      [4] #notify#622
        @ ./condition.jl:148 [inlined]
      [5] notify (repeats 2 times)
        @ ./condition.jl:148 [inlined]
      [6] (::Base.var"#notifywaiters#625")(rl::ReentrantLock)
        @ Base ./lock.jl:187
      [7] (::Base.var"#_unlock#624")(rl::ReentrantLock)
        @ Base ./lock.jl:183
      [8] unlock
        @ ./lock.jl:177 [inlined]
      [9] h5t_get_native_type(dtype_id::HDF5.Datatype, direction::Int64)
        @ HDF5.API ~/.julia/packages/HDF5/TcavY/src/api/functions.jl:7026
     [10] h5t_get_native_type
        @ ~/.julia/packages/HDF5/TcavY/src/api/helpers.jl:891 [inlined]
     [11] _memtype(filetype::HDF5.Datatype, #unused#::Type{Int64})
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/readwrite.jl:290
     [12] setindex!(::HDF5.Dataset, ::Matrix{Int64}, ::Int64, ::Int64)
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:327
     [13] setindex!(::HDF5.Dataset, ::Int64, ::Int64, ::Int64)
        @ HDF5 ~/.julia/packages/HDF5/TcavY/src/datasets.jl:357
     [14] macro expansion
        @ ~/.julia/dev/HDF5Test/test/rectest.jl:36 [inlined]
     [15] (::var"#3#4")()
        @ Main ./threadingconstructs.jl:340

Versioninfo for this platform:

Julia Version 1.9.0-beta3
Commit 24204a73447 (2023-01-18 07:20 UTC)
Platform Info:
  OS: macOS (arm64-apple-darwin21.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 8 on 8 virtual cores
simonbyrne commented 1 year ago

I still can't replicate it. What is your full status -m?

Can you also try running it with --startup-file=no to make sure you're not picking up anything in your startup file?

mkitti commented 1 year ago

My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.

Nonetheless, this points to a fundamental issue. We should not be locking and unlocking during finalization from a different task.

simonbyrne commented 1 year ago

I guess I still don't understand how this can happen after #1021.

mkitti commented 1 year ago

I guess I still don't understand how this can happen after #1021.

The problem was caused by #1021. It is because the finalizer tries to call close. As of #1021, close tries to lock and unlock, which causes a potential task switch in the finalizer and that is not allowed.

I raised this and you responded in https://github.com/JuliaIO/HDF5.jl/pull/1021#issuecomment-1306659168 . So either https://github.com/JuliaLang/julia/pull/38487 does not actually address this or there has been a regression.

simonbyrne commented 1 year ago

My guess is that we also need to be under low memory conditions for the GC to be attempting to finalize things in the middle of this loop.

You're right: I was able to trigger it with --heap-size-hint=50M.

mkitti commented 1 year ago

Yes, here is a MWE. Run the whole block at once for it to happen. Adding GC.disable/enable fixes the problem seemingly...

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

simonbyrne commented 1 year ago

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

Yes.

mkitti commented 1 year ago

So far I am unable to replicate on Windows.

mkitti commented 1 year ago

I can replicate on macOS but not consistently. It seemed to occur more frequently when I first logged in.

simonbyrne commented 1 year ago

Possibly related: https://github.com/JuliaLang/julia/issues/47612

mkitti commented 1 year ago

I think we need to use trylock in the finalizers.

function try_close_finalizer(x)
   # Test and test lock, see trylock docstring
   # Only proceed if the finalizer thread can acquire the lock, do not wait
   if(!islocked(API.liblock) && trylock(API.liblock))
       try
           close(x)
       finally
           unlock(API.liblock)
       end
   else
       # try again later
       # TODO: consider exponential backoff
       finalizer(try_close_finalizer, x) 
   end
end
mkitti commented 1 year ago

User code may want to try to acquire the API lock. Shouldn't this disable finalizers due to https://github.com/JuliaLang/julia/blob/72bebdf2715f94bad52c701cfb867744a5b070d6/base/lock.jl#L96 ? edit: This makes the problem worse.

t = Threads.@spawn begin
    lock(HDF5.API.liblock)
    try
        for i=1:1_000_000
            d[i,1] = 1
        end
    finally
        unlock(HDF5.API.liblock)
    end
end

or maybe

t = Threads.@spawn begin
    @lock HDF5.API.liblock begin
        for i=1:1_000_000
            d[i,1] = 1
        end
    end
end
mkitti commented 1 year ago

On macOS, once I find a terminal process that causes the error, then this reliably creates a problem:

$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
           using HDF5

           fid = h5open(tempname() * ".h5", "w")

           d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

           t = Threads.@spawn begin
               #GC.enable(false)
               for i=1:1_000_000
                   d[i,1] = 1
               end
               #GC.enable(true)
           end

           wait(t)

           close(fid)
       end
error in running finalizer: ErrorException("val already in a list")
error at ./error.jl:35
push! at ./linked_list.jl:53 [inlined]
_wait2 at ./condition.jl:87
#wait#621 at ./condition.jl:127
wait at ./condition.jl:125 [inlined]
slowlock at ./lock.jl:156
lock at ./lock.jl:147 [inlined]
h5i_is_valid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/api/functions.jl:1960
isvalid at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:19 [inlined]
close at /Users/kittisopikulm/.julia/packages/HDF5/TcavY/src/properties.jl:11
unknown function (ip: 0x1170d405f)
ijl_apply_generic at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
run_finalizer at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
jl_gc_run_finalizers_in_list at /Users/kittisopikulm/.julia/juliaup/julia-1.9.0-beta3+0.aarch64.apple.darwin14/lib/julia/libjulia-internal.1.9.dylib (unknown line)
...

Meanwhile, the using do with h5open does not create a problem:

$ julia +1.9 -t auto --heap-size-hint=50M --banner=no
julia> begin
           using HDF5

           h5open(tempname() * ".h5", "w") do fid

              d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

              t = Threads.@spawn begin
                  for i=1:1_000_000
                       d[i,1] = 1
                  end
              end
              wait(t)
           end
       end
mkitti commented 1 year ago

Hmm... acquiring the lock causes a lot of error message scrolling.

julia> begin
       using HDF5

       fid = h5open(tempname() * ".h5", "w")

       d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

       t = Threads.@spawn @lock HDF5.API.liblock begin
           #GC.enable(false)
           for i=1:1_000_000
               d[i,1] = 1
           end
           #GC.enable(true)
       end

       wait(t)

       close(fid)
       end

Using the do block seems to not cause the problem:

julia> begin
       using HDF5

       h5open(tempname() * ".h5", "w") do fid

       d = create_dataset(fid, "data", datatype(Int), ((1_000_000,1), (-1,1)), chunk=(1,1))

       t = Threads.@spawn @lock HDF5.API.liblock begin
           #GC.enable(false)
           for i=1:1_000_000
               d[i,1] = 1
           end
           #GC.enable(true)
       end

       wait(t)

       close(fid)
       end
       end
mkitti commented 1 year ago

Does GC.enable_finalizers(false) / GC.enable_finalizers(true) also fix the issue?

Yes.

This only avoids the error if I use GC.enable_finalizers outside of the Threads.@spawn. Using them inside causes much error scrolling.

mkitti commented 1 year ago

Implementing try_close_finalizer above appears to resolve the issue.

diff --git a/src/HDF5.jl b/src/HDF5.jl
index fb95abf..9e29340 100644
--- a/src/HDF5.jl
+++ b/src/HDF5.jl
@@ -62,8 +62,24 @@ export @read,
 # H5DataStore, Attribute, File, Group, Dataset, Datatype, Opaque,
 # Dataspace, Object, Properties, VLen, ChunkStorage, Reference

+
+
 h5doc(name) = "[`$name`](https://portal.hdfgroup.org/display/HDF5/$(name))"

+function try_close_finalizer(x)
+    if !islocked(API.liblock) && trylock(API.liblock)
+        try
+            close(x)
+        finally
+            unlock(API.liblock)
+        end
+    else
+        finalizer(try_close_finalizer, x)
+    end
+end
+#const try_close_finalizer = Base.close
+
+
 include("api/api.jl")
 include("properties.jl")
 include("context.jl")
diff --git a/src/properties.jl b/src/properties.jl
index 1ca0033..88f16f9 100644
--- a/src/properties.jl
+++ b/src/properties.jl
@@ -104,7 +104,7 @@ macro propertyclass(name, classid)
             id::API.hid_t
             function $name(id::API.hid_t)
                 obj = new(id)
-                finalizer(close, obj)
+                finalizer(try_close_finalizer, obj)
                 obj
             end
         end
diff --git a/src/types.jl b/src/types.jl
index eea17a9..980d240 100644
--- a/src/types.jl
+++ b/src/types.jl
@@ -5,6 +5,7 @@
 # Supertype of HDF5.File, HDF5.Group, JldFile, JldGroup, Matlabv5File, and MatlabHDF5File.
 abstract type H5DataStore end

+
 # Read a list of variables, read(parent, "A", "B", "x", ...)
 function Base.read(parent::H5DataStore, name::AbstractString...)
     tuple((read(parent, x) for x in name)...)
@@ -41,7 +42,7 @@ mutable struct File <: H5DataStore
     function File(id, filename, toclose::Bool=true)
         f = new(id, filename)
         if toclose
-            finalizer(close, f)
+            finalizer(try_close_finalizer, f)
         end
         f
     end
@@ -55,7 +56,7 @@ mutable struct Group <: H5DataStore

     function Group(id, file)
         g = new(id, file)
-        finalizer(close, g)
+        finalizer(try_close_finalizer, g)
         g
     end
 end
@@ -69,7 +70,7 @@ mutable struct Dataset

     function Dataset(id, file, xfer=DatasetTransferProperties())
         dset = new(id, file, xfer)
-        finalizer(close, dset)
+        finalizer(try_close_finalizer, dset)
         dset
     end
 end
@@ -84,14 +85,14 @@ mutable struct Datatype
     function Datatype(id, toclose::Bool=true)
         nt = new(id, toclose)
         if toclose
-            finalizer(close, nt)
+            finalizer(try_close_finalizer, nt)
         end
         nt
     end
     function Datatype(id, file::File, toclose::Bool=true)
         nt = new(id, toclose, file)
         if toclose
-            finalizer(close, nt)
+            finalizer(try_close_finalizer, nt)
         end
         nt
     end
@@ -106,7 +107,7 @@ mutable struct Dataspace

     function Dataspace(id)
         dspace = new(id)
-        finalizer(close, dspace)
+        finalizer(try_close_finalizer, dspace)
         dspace
     end
 end
@@ -119,7 +120,7 @@ mutable struct Attribute

     function Attribute(id, file)
         dset = new(id, file)
-        finalizer(close, dset)
+        finalizer(try_close_finalizer, dset)
         dset
     end
 end
TacHawkes commented 1 year ago

https://github.com/JuliaIO/HDF5.jl/pull/1049 fixes this. Big thanks to @mkitti for the fast fix!

laborg commented 1 year ago

I've stumbled over this bug too. Any chance to land the fix and cut a quick bugfix release?

mkitti commented 1 year ago

Could you confirm that #1049 fixes the issue for you? If so, I will accelerate the merge.

cc: @musm

laborg commented 1 year ago

Yes, the branch worked.