JuliaIO / JLD2.jl

HDF5-compatible file format in pure Julia
Other
547 stars 85 forks source link

Custom serialization with duplicated instances with Vector #436

Closed dpinol closed 8 months ago

dpinol commented 1 year ago

Similar to #431, but now with Vector, which is mutable.

using JLD2

abstract type AT end
Base.@kwdef mutable struct T1 <: AT
    f
end

Base.@kwdef mutable struct T2 <: AT
    t1s::Set{T1}
    t1::T1
end

const FS = Vector{Pair{Symbol, Any}}

(JLD2.writeas(::Type{T})) where {T <: AT} = FS

function JLD2.wconvert(::Type{FS}, t::AT)
    @info "saving $(typeof(t))"
    return [Pair{Symbol, Any}(f, getproperty(t, f)) for f in fieldnames(typeof(t))]
end
function JLD2.rconvert(::Type{T}, fs::FS) where {T <: AT}
    @info "loading $T"
    t = T(; fs...)
    return t
end

t1 = T1(1)
t2 = T2(Set([t1]), t1)
save_object("kk.jld2", t2)

t2 = load_object("kk.jld2")
@info "t1s" t2.t1 === only(t2.t1s)
julia> save_object("kk.jld2", t2)
[ Info: saving T2
[ Info: saving T1
[ Info: saving T1

julia> t2 = load_object("kk.jld2")
[ Info: loading T1
[ Info: loading T1
[ Info: loading T2
T2(Set(T1[T1(1)]), T1(1))

julia> @info "t1s" t2.t1 === only(t2.t1s)
┌ Info: t1s
└   t2.t1 === only(t2.t1s) = fals

If I put the Vector within a mutable struct it works fine, but it introduces a significant overhead in terms of time and space. Or maybe there's a better way to perform generic custom serialization of structs?

thanks!

JonasIsensee commented 1 year ago

I am starting to get curious, what you are trying to achieve.

I don't think anyone has ever tried to do such weirdly nested custom serialization before, which is probably why you are now finding so many corner cases.

dpinol commented 1 year ago

I have a hyper hierarchical structure of 3 different types of structs with callbacks from the inner to the outer layers. Creating the model takes a long time, anf hence I want to persist it. Since the model is still under evolution, I want to ensure backwards compatibility of old models when loaded from future versions of the software. So my idea is to 'take control' of how each struct is persisted through customly serializating each field like I do in the MWE. With this, if I also persist a model version label, when I load old serializations, I can easily patch them to accommodate to the new data layout.

JonasIsensee commented 1 year ago

I have a hyper hierarchical structure of 3 different types of structs with callbacks from the inner to the outer layers. Creating the model takes a long time, anf hence I want to persist it. Since the model is still under evolution, I want to ensure backwards compatibility of old models when loaded from future versions of the software. So my idea is to 'take control' of how each struct is persisted through customly serializating each field like I do in the MWE. With this, if I also persist a model version label, when I load old serializations, I can easily patch them to accommodate to the new data layout.

ok, that makes sense. I'm still wondering whether there might be a better way i.e. a safe-mode of JLD2 activated via kword argument that reconstructs all / some custom structs into a Dict-like structure. Then one could implement an upgrade-path based on that and use regular serializing (which should give better performance..).

I've tried to fix the bug reported above, but haven't been able to fix it, yet.

dpinol commented 1 year ago

I'm still wondering whether there might be a better way i.e. a safe-mode of JLD2 activated via kword argument that reconstructs all / some custom structs into a Dict-like structure. Then one could implement an upgrade-path based on that and use regular serializing (which should give better performance..).

Yes, in the future I'd love to be able to implement a kind of preprocess(Type, Dict, Context). The Context would be like Golang context and could be passed to jldopen which would allow implementing deadlines, cancelations, progressbar, enriching data from an outer struct...

I've tried to fix the bug reported above, but haven'1t been able to fix it, yet.

thanks again! :+1:

JonasIsensee commented 1 year ago

Yes, in the future I'd love to be able to implement a kind of preprocess(Type, Dict, Context). The Context would be like Golang context and could be passed to jldopen which would allow implementing deadlines, cancelations, progressbar, enriching data from an outer struct...

I'd say, most of the ingredients on the backend are already there. One would have to properly specify and implement an API. That could probably even be tested in a separate package.

JonasIsensee commented 1 year ago

Hi @dpinol,

take a look at #439 . This could potentially make your life a bit easier. (It preserves object identity stuff in the cases I tested) With this you probably won't need to use custom serialization on save.

dpinol commented 1 year ago

Hi, thanks for new upgrade feature. However, I just tested the use case at the description of this issue and the behaviour has not changed (custom serialization through vector duplicates the deserialized instances even if both the struct (and Vector are mutable). Should the issue be reopened, or a least the limitation should be noticed in the documentation?

thanks

JonasIsensee commented 8 months ago

I've tried and failed to fix this once more and decided that writeas(..) = Array(...) should be unsupported. I'm switching internal usage away from it. (This also gets rid of #427 )