JuliaIO / JLD2.jl

HDF5-compatible file format in pure Julia
Other
547 stars 85 forks source link

Variables named with modifiers (like \hat) are incorrectly saved #466

Closed wsshin closed 3 months ago

wsshin commented 1 year ago

I can save variables named with modifiers (like \hat) with no problem:

julia> â = 1;  # â is entered by a\hat[tab]

julia> jldsave("foo.jld2"; â)

However, if I try to load the saved variable, an error is generated complaining that it cannot find the variable name:

julia> load("foo.jld2", "â")  # â is entered by a\hat[tab]
Error encountered while load FileIO.File{FileIO.DataFormat{:JLD2}, String}("foo.jld2").

Fatal error:
ERROR: KeyError: key "â" not found
[...]

Strangely, if I simply load the file without specifying the variable name, the result shows that the variable is actually loaded:

julia> load("foo.jld2")
Dict{String, Any} with 1 entry:
  "â" => 1

It turns out that the loaded "â" has a different byte representation than the saved "â":

julia> codeunits("â")  # â is entered by a\hat[tab]
3-element Base.CodeUnits{UInt8, String}:
 0x61
 0xcc
 0x82

julia> codeunits("â")  # â is entered by copy-and-pasting the output of load("foo.jld2") above
2-element Base.CodeUnits{UInt8, String}:
 0xc3
 0xa2

So, somewhere during jldsave() seems to change the byte representation of "â".

Here is the version info:

julia> VERSION
v"1.9.0-rc1"

(@v1.9) pkg> st JLD2
Status `~/.julia/environments/v1.9/Project.toml`
  [033835bb] JLD2 v0.4.31
JonasIsensee commented 1 year ago

Hi @wsshin,

I'm afraid, this isn't really a problem restricted to JLD2 but one more generally with Unicode. There appear to be two different unicode representations of the "same" visual symbol and julia generates a different version depending on how you create it.

julia> :â
:â

julia> string(:â)
"â"

julia> string(:â) == "â"
false

julia> codeunits(string(:â))
2-element Base.CodeUnits{UInt8, String}:
 0xc3
 0xa2

julia> codeunits("â")
3-element Base.CodeUnits{UInt8, String}:
 0x61
 0xcc
 0x82

julia> Symbol(string(:â)) == :â
true

julia> string(Symbol("â")) == "â"
true

julia> Symbol("â") == :â
false

julia> string(:â) == "â"
false
wsshin commented 1 year ago

Thanks @JonasIsensee. I reported the issue to JuliaLang/julia.

oscardssmith commented 1 year ago

The correct solution here is for JLD2 to apply normalization before doing the lookup.

JonasIsensee commented 1 year ago

The place to edit is here, I think: https://github.com/JuliaIO/JLD2.jl/blob/30dd57839159945fd3d17886891fe30b19367703/src/groups.jl#L25

The correct function to compare strings is Unicode.is_equal_normalized() and one could consider always adding a normalization step prior to saving (or loading the whole thing) with Unicode.normalize()

ggebbie commented 1 year ago

I hit this with a variable named G\bar on Julia 1.9.0. I will rename variable to G for now. Thanks for documenting this issue.

wsshin commented 5 months ago

@JonasIsensee, is there any reason why your https://github.com/JuliaIO/JLD2.jl/issues/466#issuecomment-1491471559 cannot be implemented? I am experiencing this problem a year from my initial report, and I find that it hasn't been resolved yet.

JonasIsensee commented 5 months ago

Oh, it should be relatively easy to fix. You are welcome to submit a PR.

JonasIsensee commented 1 month ago

Ensuring unicode normalization at every step, as required for the behavior added in #561, caused a significant loss in performance for working with files independent of whether they used unicode or not.

Due to this, this fix was reverted in v0.4.51. I would recommend that you add the normalization to your code or not rely on unnormalized unicode.