JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
380 stars 138 forks source link

Segfault when writing variable length string as attribute #1129

Closed ericphanson closed 6 months ago

ericphanson commented 8 months ago
julia> using HDF5

julia> fid = h5open("test.h5", "w")
🗂️ HDF5.File: (read-write) test.h5

julia> attr = create_attribute(fid, "attr-name", datatype(String), dataspace(String))
🏷️ HDF5.Attribute: attr-name

julia> write_attribute(attr, datatype(String), "attr-value")

[16750] signal (11.2): Segmentation fault: 11
in expression starting at REPL[4]:1
_platform_strlen at /usr/lib/system/libsystem_platform.dylib (unknown line)
Allocations: 1346244 (Pool: 1345285; Big: 959); GC: 2
zsh: segmentation fault  julia --project

using

  [f67ccb44] HDF5 v0.17.1

and

Julia Version 1.9.3
Commit bed2cd540a1 (2023-08-24 14:43 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 8 × Apple M1
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, apple-m1)
  Threads: 4 on 4 virtual cores
Environment:
  JULIA_VERSION = 1.9.3
  JULIA_NUM_THREADS = 4
  JULIA_PKG_SERVER_REGISTRY_PREFERENCE = eager

Also: the documentation around create_attribute is confusing. It says to use write_attribute to actually write the data, but if you don't use the returned Attribute object from create_attribute (instead using write_attribute(fid, "attr-key", "attr-value")), then it will error and say the attribute already exists.

ericphanson commented 8 months ago

I was able to workaround it with the following code:

julia> function write_variable_length_string_attribute(fid, attr_key::String, attr_value::String)
           attr = create_attribute(fid, attr_key, datatype(String), dataspace(String))
           v = Vector{UInt8}(attr_value)
           GC.@preserve v begin
               p = pointer(v)
               write_attribute(attr, datatype(String), Ref(p))
           end
           return nothing
       end
write_variable_length_string_attribute (generic function with 1 method)

julia> fid = h5open("test.h5", "w")
🗂️ HDF5.File: (read-write) test.h5

julia> write_variable_length_string_attribute(fid, "attr-key", "attr-value")

julia> close(fid)

shell> h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
   ATTRIBUTE "attr-key" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "attr-value"
      }
   }
}
}

The context here is I need to write a variable-length string as an attribute, so some python code using h5py will interpret the attribute as a string and not a numpy byte array (xref https://docs.h5py.org/en/stable/strings.html).

ericphanson commented 8 months ago

I don't know if something specific should be done to ensure null termination. I added a branch here, though the output seems exactly the same:

julia> using HDF5

julia> fid = h5open("test.h5", "w")
🗂️ HDF5.File: (read-write) test.h5

julia> function write_variable_length_string_attribute(fid, attr_key::String, attr_value::String)
           attr = create_attribute(fid, attr_key, datatype(String), dataspace(String))
           v = Vector{UInt8}(attr_value)
           v[end] == 0 || push!(v, 0) # null termination?
           GC.@preserve v begin
               p = pointer(v)
               write_attribute(attr, datatype(String), Ref(p))
           end
           return nothing
       end
write_variable_length_string_attribute (generic function with 1 method)

julia> write_variable_length_string_attribute(fid, "attr-key", "attr-value")

julia> close(fid)

shell> h5dump test.input
h5dump error: unable to open file "test.input"

shell> h5dump test.h5
HDF5 "test.h5" {
GROUP "/" {
   ATTRIBUTE "attr-key" {
      DATATYPE  H5T_STRING {
         STRSIZE H5T_VARIABLE;
         STRPAD H5T_STR_NULLTERM;
         CSET H5T_CSET_UTF8;
         CTYPE H5T_C_S1;
      }
      DATASPACE  SCALAR
      DATA {
      (0): "attr-value"
      }
   }
}
}
simonbyrne commented 8 months ago

We should update the docs to recommend everyone use attrs

attrs(fid)["attr-key"] = "attr-value"

I don't get why this is segfaulting though?

simonbyrne commented 8 months ago

oh, i misunderstood: by default we write them as fixed length strings...

simonbyrne commented 8 months ago

It looks like we have to pass a pointer to a string pointer.

simonbyrne commented 8 months ago

See https://forum.hdfgroup.org/t/how-to-create-scalar-variable-length-string/10309/3

ericphanson commented 8 months ago

Is

function write_variable_length_string_attribute(fid, attr_key::String, attr_value::String)
           attr = create_attribute(fid, attr_key, datatype(String), dataspace(String))
           v = Vector{UInt8}(attr_value)
           v[end] == 0 || push!(v, 0) # null termination?
           GC.@preserve v begin
               p = pointer(v)
               write_attribute(attr, datatype(String), Ref(p))
           end
           return nothing
       end

safe/legit? It seems to work, but I don't really know what I am doing

simonbyrne commented 8 months ago

Most reliable option is to use cconvert/unsafe_convert:

julia> using HDF5

julia> fid = h5open("test.h5", "w")
🗂️ HDF5.File: (read-write) test.h5

julia> attr = create_attribute(fid, "attr-name", datatype(String), dataspace(String))
🏷️ HDF5.Attribute: attr-name

julia> val = Base.cconvert(Cstring, "attr-val") # ensures string is nul-terminated
"attr-val"

julia> GC.@preserve val begin
          p = Base.unsafe_convert(Cstring, val)
          write_attribute(attr, datatype(String), Ref(p))
       end

julia> close(fid)