automerge / automerge

A JSON-like data structure (a CRDT) that can be modified concurrently by different users, and merged again automatically.
https://automerge.org
MIT License
3.93k stars 172 forks source link

load_incremental() does not communicate partial success #354

Open orionz opened 2 years ago

orionz commented 2 years ago

If a load_incremental() has some valid changes and some corrupt data this is not surfaced to the user. While applying what changes are valid seems like the correct course we should communicate this information. The current return value of load_incremental() is the number of ops loaded - this isn't very useful information.

My proposal: if the data has zero valid changes - return an error otherwise return the number of valid bytes processed - if the data was totally valid this will equal the length of the buffer passed in

orionz commented 2 years ago

What should load() do in the case of a partial load?

ept commented 2 years ago

Good question. An all-or-nothing (atomic) approach might be easiest to understand: that is, if an error occurs anywhere, the function should return an error and leave the document unmodified, like a transaction abort in a database.

However, an alternative argument would be that we should try and recover as much as we can of a corrupted file. That would allow a storage model where an app simply appends changes to a file on the filesystem, and if a crash occurs halfway through writing a change, we can still read most of the file and only need to ignore the incompletely written change at the end. This is similar to what a database would do with the WAL on recovering from a crash.

Your proposal of returning the number of bytes processed would satisfy the latter use case, and it would allow the storage layer to truncate the file to keep the portion of the file that is intact, and discard everything from the corrupted record onwards. That may of course lead to data loss, but arguably it would lose less data than refusing to load the file entirely, so maybe that's better?

For load() I think you could take the same approach of returning the number of bytes successfully loaded. This would make sense especially if the data being loaded contains a mixture of a compressed document and a bunch of changes that have been applied on top of it, or even two compressed documents that have been concatenated.

orionz commented 2 years ago

My proposal would be to have load_incremental() return the number of bytes loaded (in rust and js) as its current return value isn't very useful to anyone and I dont think it'll be missed. load() on the other hand would be burdened with a more complex return type. My thought would be not to change the return type by add a public bytes_loaded field on a doc (which would be 0 for docs created instead of loaded) and allow users to access this value (rather than having to unpack a tuple every time they load a doc).