JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
383 stars 139 forks source link

Functional API with Strings / variable-size strings #1067

Closed mkitti closed 1 year ago

mkitti commented 1 year ago

This is a response to https://discourse.julialang.org/t/hdf5-jl-variable-length-string/98808 adding support for just passing String as a type to create_dataset.

This allow allows Base.setindex! with Strings.

Traditionally, we omitted this because variable length strings are probably not a great idea to store in HDF5. Preferably, one would use fixed-length strings.

Todo:

mkitti commented 1 year ago

Demonstration of this pull request

julia> using HDF5

julia> h5open("jltest.h5", "w") do f
           ds = create_dataset(f, "strings", String, (4,))
           ds[1] = "Hello"
           ds[2] = "Hi"
           ds[3] = "Bonjour"
           ds[4] = "Hola"
       end
"Hola"

julia> h5f = h5open("jltest.h5")
🗂️ HDF5.File: (read-only) jltest.h5
└─ 🔢 strings

julia> h5f["strings"]
🔢 HDF5.Dataset: /strings (file: jltest.h5 xfer_mode: 0)

julia> h5f["strings"][1]
"Hello"

julia> h5f["strings"][2]
"Hi"

julia> h5f["strings"][3]
"Bonjour"

julia> h5f["strings"][4]
"Hola"

Before this pull request:

julia> h5open("jltest.h5", "w") do f
           ds = create_dataset(f, "strings", String, (4,))
           ds[1] = "Hello"
           ds[2] = "Hi"
           ds[3] = "Bonjour"
           ds[4] = "Hola"
       end
ERROR: Type Symbol does not have a definite size.
Stacktrace:
  [1] sizeof(x::Type)
    @ Base ./essentials.jl:559
  [2] hdf5_type_id(#unused#::Type{Symbol}, isstruct::Val{true})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:71
  [3] hdf5_type_id(#unused#::Type{Symbol})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:69
  [4] hdf5_type_id(#unused#::Type{Core.TypeName}, isstruct::Val{true})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:74
  [5] hdf5_type_id(#unused#::Type{Core.TypeName})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:69
  [6] hdf5_type_id(#unused#::Type{DataType}, isstruct::Val{true})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:74
  [7] hdf5_type_id(#unused#::Type{DataType})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:69
  [8] datatype(#unused#::Type{String})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/typeconversions.jl:66
  [9] create_dataset(parent::HDF5.File, path::String, dtype::Type, dspace_dims::Tuple{Int64}; pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/datasets.jl:103
 [10] create_dataset(parent::HDF5.File, path::String, dtype::Type, dspace_dims::Tuple{Int64})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/datasets.jl:103
 [11] (::var"#3#4")(f::HDF5.File)
    @ Main ./REPL[4]:2
 [12] (::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#3#4", HDF5.File})()
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/file.jl:98
 [13] task_local_storage(body::HDF5.var"#17#18"{HDF5.HDF5Context, Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}}, var"#3#4", HDF5.File}, key::Symbol, val::HDF5.HDF5Context)
    @ Base ./task.jl:296
 [14] h5open(::var"#3#4", ::String, ::Vararg{String}; context::HDF5.HDF5Context, pv::Base.Pairs{Symbol, Union{}, Tuple{}, NamedTuple{(), Tuple{}}})
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/file.jl:93
 [15] h5open(::Function, ::String, ::String)
    @ HDF5 ~/.julia/packages/HDF5/HtnQZ/src/file.jl:91
 [16] top-level scope
    @ REPL[4]:1
mkitti commented 1 year ago

I'm still debating if we should fully integrate InlineStrings.jl and/or StaticStrings.jl or if we should the package extension mechanism for those. It's probably easier to move those forward as a package extension.