JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
386 stars 140 forks source link

Remove use of `tempname()` from `setindex!` of an `AttributeDict` #1026

Closed nhz2 closed 1 year ago

nhz2 commented 1 year ago

Currently setindex! uses tempname() to get a random unique name for a temporary attribute. However, tempname() is quite slow and requires reading the file system to check if the names it generates are unique. In this PR I am replacing tempname() with "-><7;*<_NY" * bytes2hex(rand(Random.RandomDevice(), UInt8, 16))

julia> randname1() = "-><7;*<_NY" * bytes2hex(rand(Random.RandomDevice(), UInt8, 16))
randname1 (generic function with 1 method)

julia> randname2() = tempname()
randname2 (generic function with 1 method)

julia> @btime randname1()
  256.557 ns (4 allocations: 264 bytes)
"-><7;*<_NY2390f39a585aec3b0caac0f127a0a83a"

julia> @btime randname2()
  2.085 μs (9 allocations: 1.22 KiB)
"/tmp/jl_yjyjMTO1uh"
mkitti commented 1 year ago

Do we actually need a random name?

mkitti commented 1 year ago

Could we use some predictable sequence of names or perhaps some variant of the old the name?

nhz2 commented 1 year ago

In terms of needing the temperary name, from what I can tell the HDF5 C library doesn't have a way to change the datatype, dataspace, and value of an attribute at the same time. h5py is doing something similar https://github.com/h5py/h5py/blob/fcaca1d1b81d25c0d83b11d5bdf497469b5980e9/h5py/_hl/attrs.py#L191-L219 using a uuid4. I think the API.h5a_write can change the value of an attribute but cannot change the type and space, so maybe there should be a check to see if the type and space need to be changed and if not just update the value?

mkitti commented 1 year ago

uuid4 or other uuids seem like a good idea. Why don't we do that? Is that slow?

Also, why not make the name related to the old name in case there is an error and we never complete the operation?

julia> using UUIDs

julia> name = "helloworld"
"helloworld"

julia> _name = name * "_hdf5jl_" * string(uuid4())
"helloworld_hdf5jl_452f6a32-c1db-4756-bbfa-42b2fd7d0a29"
mkitti commented 1 year ago

On my rather slow computer:

julia> randname1() = "-><7;*<_NY" * bytes2hex(rand(Random.RandomDevice(), UInt8, 16))
randname1 (generic function with 1 method)

julia> randname2() = tempname()
randname2 (generic function with 1 method)

julia> randname3(name) = name * "_hdf5jl_" * string(uuid4())
randname3 (generic function with 1 method)

julia> @btime randname1()
  631.651 ns (4 allocations: 264 bytes)
"-><7;*<_NY3ac27142c20bbb1c3129af9503747718"

julia> @btime randname2()
  6.007 μs (9 allocations: 1.22 KiB)
"/tmp/jl_jEhEJkVQre"

julia> @btime randname3($name)
  608.423 ns (3 allocations: 192 bytes)
"helloworld_hdf5jl_7ca25e59-4c2e-406e-90e0-80332425c05e"
mkitti commented 1 year ago

@simonbyrne want to take a quick look?