JuliaIO / JLD2.jl

HDF5-compatible file format in pure Julia
Other
547 stars 85 forks source link

overwriting existing data #450

Closed babaq closed 8 months ago

babaq commented 1 year ago

I couldn't find a way to overwriting the variable already exists in the jld2 file.

the jldopen w/w+ mode would completely write a new file that does not preseve old content, while the a/a+ mode allow adding new variables, but now allowed to rewrite parts of the already existing content. This seems failing lots of pratical usage where one only want to update parts of the data.

I would assume it is possible to implement, is there any other reasons that it is not support now in version v0.4.30?

JonasIsensee commented 1 year ago

Let me start off by explaining why this isn't an obvious feature: Files are linear streams of bytes. While JLD2 has a hierarchical nature when for the user, the files have to somehow store everything sequentially.

Therefore it is not straightforward how to overwrite some Dataset. It will most certainly be sandwiched between other data or JLD2 Metadata. If the new data is needs more space, then it can't fit. Moving the existing data to make enough space is also very tricky.

It is already possible to delete(f, "dataset"). This will delete your reference to the data and you can write a new Dataset with the same name. Note, that the data itself remains in the "gap". So, if you do that many times, the file will grow.

If you want to update stored arrays, then there are more potential options (not yet implemented). If that is what you need, please open a separate issue.

babaq commented 1 year ago

The practical problem is that we have a large file that has been aggregated using jldopen with a+ from different sources, each of those process need long time processing, so when one of the data processing got updated, we want to preserve the other parts, but only update the result of the updated process in the file.

Other than directly overwriting parts of the file, i could think of several workaround way, e.g. read all old data into Dict in memory, and then update content and save, but the file is pretty large, or like you suggested, delete the referece to variable name, and write new data with the same variable name. Either way, what i mean is that this is a justifiable feature that solve a practical problem.

I agree that it would be a tricky, messy or ugly implementation given the fact of linear serialization of file content. It would probably have side effect, e.g. increasing size of file after partial rewriting. but it will be beneficial to at least show user how to achieve this, better to add direct support.

I've used the matfile function in MATLAB that do partial saving and reading, and found it very useful. Sometimes file do get bigger, i guess it would be hard to eliminate this effect.

I guess because arrays are linearly saved, so it would be easier to update them, without adding or deleting space in the file. This would be a special case when data need not to reorganize.

JonasIsensee commented 1 year ago

I have had similar problems in the past myself. I want to point out another problem: Every time you edit existing files you risk corrupting the file / data. This isn't specific to JLD2. Sometimes your program will die in the middle of it or way more commonly the file systems will just not write things correctly. (Both of those things have happened to me many times - network file systems are particularly error prone)

A much safer variant, that I use in my research, is to have each step of the process write its own files. In a separate step you can then combine the parts into joint result files and (possibly) delete the original parts after verifying that the output is correct.