New file format (stabilization)

flxzt commented 1 month ago

Let's track what Rnote's new file format should look like here.

There are already some ideas and implementation for improvements floating around.

Ideally backwards compatibility is kept which I think is doable.

Improvements are:

A new compression format : see #1170
Move the file format version outside of the compressed data (first bytes of the file) to improve forward compatibility
Look into using another de-serialization format than json : bincode is very promising because it greatly improves speed, file size and memory consumption. Things to consider are: how to continue making upgrade path's relatively easy to implement - currently json's untyped Value's are used, that won't work with other formats. I can see two possibilities: limit additions and changes to what is backwards compatible from the perspective of the serde derived Serialize and Deserialize trait, and if that's not possible start implementing both traits manually. This can be quite tedious in some cases though.
Storing stroke coordinates more effiently : the ink-serialized-format does something interesting - using derivatives to reduce the data for strokes with specific shapes (?) (@Doublonmousse knows more)
Partial file loading : On large documents it might be preferable not having to load the entire document into memory when viewing/editing only some pages. Is is possible to encode that into the format similar to how we use an R-Tree for storing stroke bounding boxes?

Doublonmousse commented 1 month ago

For the ink serialized format, there is no floats used to store ink strokes, there's only a minimum fixed size (like 1/1000th of a cm) and quantization is used. Hence everything is using ints. The other thing used is that instead of saving the data as

a b c etc...

It's saved as

a, (b-a), c - (a + 2 (b-a)) etc...

Hence first element, first derivative (difference) and then only second derivatives. The general idea is that the second derivatives are usually very small, hence additional compression can be done on top of this (huffman or bitpacking as we expect low values that can be represented on very few bits). The code is actually public : https://source.dot.net/#PresentationCore/MS/Internal/Ink/InkSerializedFormat/ under the MIT License so maybe I'll look into having that in rust, as a way to have a little more interoperability with other apps, including onenote when copying/pasting between apps.

I'd be great though to still have an easy way for people to obtain a human-readable or json version of the file and/or the file specification. Having vendor lock in for files because of undocumented binary files, or documented but without readily available readers that most people can use is not something I want, having been victim of this myself and the sheer insanity of trying to get the data out.

For partial file loading, the natural thing would be to have something based on pages (separate files in a zipped folder seems common) but this wouldn't be enough because strokes that are on more than one page can occur. Maybe having optional files corresponding to sizes 2^n * 2^n pages including only strokes that can't be cast into a smaller children would work. This would mean it'd be relatively easy to test for strokes that are across pages, something that would be useful for any page management functionality (in addition to having some concept of a page).

anesthetice commented 1 month ago

Thoughts on something like this? Untitled-2024-08-11-2211

anesthetice commented 4 weeks ago

Untitled-2024-08-14-1317

flxzt commented 1 week ago

Alright, here's what I think:

I think a magic number and the version being the first bytes of data are a good idea. Regarding the version field : Maybe we should put this into the header so we can just use the semver's crate de/serialization implementation as well? It seems to be using u64 for it's version fields and contains additional fields for pre-release and build metadata (see here). It also seems to be more sophisticated and flexible in comparison to a custom solution - where I'll fear we'll run into issues we didn't anticipate at some point. the first bytes would then just be the magic number and the header size.
The header itself : I am not sure about adding data about the author, creation and last modified date. I suppose if we would want to add author data we could just add it to the Document struct instead? Creation date and last modified date sound like file properties to me which maybe shouldn't be replicated in our metadata. Unless you have another specific reason for suggesting to add them? The compression and serialization format : is that not already encoded in the version information? I am leaning toward just having a single compression and serialization format for a single version, if we want to change either one we should just bump the file format version instead.

I think a bigger talking point would be the format of the data itself.

Maybe for now we can stick to what we are already doing by just using what serde gives us - the strokes store without any modifications to it's data layout - but I think it is worth discussing how the serialization layout can be different from the data layout of the application itself, especially with regards to partial file loading.

Doublonmousse commented 1 week ago

I think the compression being in the header was to accomodate the possibility to save to and from the raw json encoding (so that you can uncompress the data a little more easily if needed). It doesn't have to be inside of the fileformat header (you could also have an option somewhere in the app and CLI to do .rnote to .json and back as well).

I think it was also a way to test more easily different compression methods as well

anesthetice commented 1 week ago

* I think a magic number and the version being the first bytes of data are a good idea.
  Regarding the version field : Maybe we should put this into the header so we can just
  use the `semver`'s crate de/serialization implementation as well? It seems to be using `u64` for it's version fields and contains additional fields for pre-release and build metadata (see [here](https://github.com/dtolnay/semver/blob/master/src/lib.rs#L162)). It also seems to be more sophisticated and flexible in comparison to a custom solution - where I'll fear we'll run into issues we didn't anticipate at some point.
  the first bytes would then just be the magic number and the header size.

Good point, keeping the version separate from the header was done for flexibility (i.e. renaming fields without having to specify the old alias, changing the serialization method for the header, adding new fields that do not need to implement Default, etc.), but I could include a more complete version inside the prelude to allow the use of 'Prerelease' and 'BuildMetadata' (i.e. [u64, u64, u64, Prerelease size (u8/u16), Prerelease (str), BuildMetadata size (u8/16) , BuildMetadata (str)]

edit : https://github.com/flxzt/rnote/pull/1177/commits/b8062ecaf56d3619ffffe381c39dc70012eff3b0

* The header itself : I am not sure about adding data about the author, creation and last modified date. I suppose if we would want to add author data we could just add it to the `Document` struct instead? Creation date and last modified date sound like file properties to me which maybe shouldn't be replicated in our metadata. Unless you have another specific reason for suggesting to add them?
  The compression and serialization format : is that not already encoded in the version information? I am leaning toward just having a single compression and serialization format for a single version, if we want to change either one we should just bump the file format version instead.

What @Doublonmousse said is correct, furthermore this allows users to specify a compression level (transmitted to SavePrefs), and makes it trivial to add and test different compression and serialization methods.

\ (I can't take a screenshot of the options but they are as follows: Very High, High, Medium, Low, Very Low)

I think a bigger talking point would be the format of the data itself.

Maybe for now we can stick to what we are already doing by just using what serde gives us - the strokes store without any modifications to it's data layout - but I think it is worth discussing how the serialization layout can be different from the data layout of the application itself, especially with regards to partial file loading.

I haven't worked on this aspect, however the 'body' of the proposed file format is a vector of u8 instead of the current ijson value, which makes it very flexible

For completeness, here's the latest excalidraw image:

rnotefileformat

eldipa commented 1 week ago

In addition to the version, you may want to add feature flags. The idea is to have 3 sets of flags: feature_compat, feature_ro_compat and feature_incompat. On reading a file,

if a flag is set in feature_compat but rnote does not know about it, it is OK to keep reading and writing the file.
if a flag is set in feature_ro_compat but rnote does not know about it, it is OK to keep reading but the file must not be modified in any way.
if a flag is set in feature_incompat but rnote does not know about it, rnote must stop reading the file.

The use of feature flags is allow more flexibility to communicate between version of rnote.

For example, imagine that in version V2, rnote supports a non-backward-compatible feature that the user may or may not use. If the user does not use such feature, the file produced by rnote V2 should still be readable by V1, it is just a matter of checking that the flag associated to that non-backward-compatible feature is not set in feature_incompat.

Using versions alone it is not possible (or harder) to know for V1 if it is safe or not to open a file created by V2.

Of course, this idea is not mine, I borrowed from ext4 file system.

anesthetice commented 6 days ago

https://github.com/flxzt/rnote/commit/b8062ecaf56d3619ffffe381c39dc70012eff3b0 rnotefileformat

edit: I accidentally used an older version of the drawing to update the prelude version text

flxzt / rnote

New file format (stabilization) #1173