JuliaIO / HDF5.jl

Save and load data in the HDF5 file format from Julia
https://juliaio.github.io/HDF5.jl
MIT License
386 stars 140 forks source link

Unexpected create_group key from readdlm string data #1055

Closed Jerboa-app closed 1 year ago

Jerboa-app commented 1 year ago

When I try to create_group using a string, taken from a csv, raw[1,1], read via readdlm, HDF5.jl seems to use the entire csv file (as one long string) as the groups key

Basic example:

using HDF5, DelimitedFiles

run(`echo -e \"a, b, c\nd, e, f\" \> t.csv`)

raw = readdlm("t.csv",',')

data = h5open("t.h5","w")

create_group(data,raw[1,1])
julia> @show keys(data) # the key is the raw csv as if read as one line
keys(data) = ["a, b, c\nd, e, f\n"]
1-element Vector{String}:
 "a, b, c\nd, e, f\n"

Even though raw[1,1] is "a" the key is the entire csv file

But if I hard code the same data it uses the single value "a"

raw2 = ["a" "b" "c"; "d" "e" "f"]

data2 = h5open("t2.h5","w")

create_group(data2,raw2[1,1])
julia> @show keys(data2) # the key is as expected, a
keys(data2) = ["a"]
1-element Vector{String}:
 "a"

If instead I convert to a string

create_group(data,String(raw[1,1]))

It works fine.

Only difference I see is raw is

raw[1,1]|>typeof
SubString{String}

and

raw2[1,1]|>typeof
String

but

julia> d = SubString{String}.(raw2)
2×3 Matrix{SubString{String}}:
 "a"  "b"  "c"
 "d"  "e"  "f"

julia> data3 = h5open("t3.h5","w")
🗂️ HDF5.File: (read-write) t3.h5

julia> create_group(data3,d[1,1])
📂 HDF5.Group: /a (file: t3.h5)

Is there a reason for this?

Found this when a group key in my dataset was a 1,214,405 line (30 MB) string!

Thanks in advance!


env

julia -v
julia version 1.8.0
(@v1.8) pkg> status
Status `~/.julia/environments/v1.8/Project.toml`
⌃ [a134a8b2] BlackBoxOptim v0.5.0
⌃ [13f3f980] CairoMakie v0.8.13
  [aaaa29a8] Clustering v0.14.3
⌅ [31c24e10] Distributions v0.23.12
  [f67ccb44] HDF5 v0.16.13
  [7073ff75] IJulia v1.24.0
  [4138dd39] JLD v0.13.3
  [b964fa9f] LaTeXStrings v1.3.0
⌅ [ee78f7c6] Makie v0.17.13
⌃ [978d7f02] MiniQhull v0.2.1
  [54e51dfa] PerceptualColourMaps v0.3.6
  [76e5bff5] PolygonTriangulation v0.1.0 `~/PolygonTriangulation.jl`
  [92933f4c] ProgressMeter v1.7.2
  [8bb1440f] DelimitedFiles
  [37e2e46d] LinearAlgebra
  [10745b16] Statistics
Info Packages marked with ⌃ and ⌅ have new versions available, but those with ⌅ cannot be upgraded. To see why use `status --outdated`
simonbyrne commented 1 year ago

The issue appears to be that it is a SubString, which aren't NUL-terminated:

julia> using HDF5

julia> data = h5open("t.h5","w")
🗂️ HDF5.File: (read-write) t.h5

julia> create_group(data,view("abcdef",1:3))
📂 HDF5.Group: /abcdef (file: t.h5)
simonbyrne commented 1 year ago

Specifically, I think the issue is that we pass the string as Ptr{UInt8} instead of Cstring: https://github.com/JuliaIO/HDF5.jl/blob/566698bfebb29d2b1af21a05153270a7d28923ff/src/api/functions.jl#L1770

Jerboa-app commented 1 year ago

Ah that might explain things, then It just continue reading?

mkitti commented 1 year ago

Specifically, I think the issue is that we pass the string as Ptr{UInt8} instead of Cstring:

Basically, if we had Cstring, then Base.cconvert would convert the SubString to a String.

julia> Base.cconvert(Cstring, view("abcdefg",1:3))
"abc"

julia> Base.cconvert(Cstring, view("abcdefg",1:3)) |> typeof
String

That said we probably should not accept an AbstractString if we are not handling this properly.

https://github.com/JuliaLang/julia/blob/a1b546ad04612272787d9ad70444ebe2dc58ac9f/base/c.jl#L195-L198

cconvert(::Type{Cstring}, s::String) = s
cconvert(::Type{Cstring}, s::AbstractString) =
    cconvert(Cstring, String(s)::String)
simonbyrne commented 1 year ago

This is better handled by using Cstring in the ccall type signature.