A better way to handle binary data

cinnamon-bun commented 3 years ago

What if you want to store binary data in Earthstar documents?

Storing binary data in Earthstar today

Base64-encode your binary data to turn it into UTF-8. Use the document path to infer if the document is binary or not (e.g. with a file extension: /foo/bar/cat.jpg is likely binary)

The Earthstar Spec discusses base64 encoding details and binary document content

What could be better

Base64 uses 1.3x the space of the original binary data, and converting a large document from base64 takes some unnecessary CPU & memory.

It's weird to have to guess the text/binary type from the document path. It should be explicit.

How we got here

We started off with JSON because we're trying to "use boring technology that everybody knows" and reduce dependencies.
JSON can't hold binary data, you have to base64 encode it
GraphQL and other things are built on JSON

Upgrading Earthstar

There are 4 places to consider encoding in Earthstar, and they're independently adjustable. We're not locked into a certain encoding like Scuttlebutt is.

The Earthstar Spec has a long section about this.

A. Hashing and signing: currently a simple custom serialization format which works with binary data
B. Document schema: currently content is required to be valid utf-8
C. Transport encoding: currently JSON
D. Storage encoding: currently SQLite and also JSON-in-localStorage

A and B are core details that need to be very standardized.

C and D are library or app-level details.

Ways to change the document schema

Option P: add a datatype field

    datatype: "binary" | "text",
    content: Buffer | UTF8String,  (or just buffer, to be interpreted as string)

Option Q: Add a mimetype field (image/jpeg, application/json, ...). This would also help set HTTP headers properly when serving data. On the other hand this data is not present in regular filesystem files so it would be harder to store to and load from files. I suspect this would frequently be set to the wrong value by app authors.

    mimetype: string,
    content: Buffer | UTF8String,

Option R: Just only store binary buffers, and make apps figure out when to interpret it as UTF-8. Unlike the other options which store whether a document is binary or text, this one eliminates JSON as a transport option unless we're willing to base64 encode every document, even the text ones:

    content: Buffer,

Option S: Documents could have both text and data fields??

    text: UTF8String,
    data: Buffer,

Encoding & Serialization Options

We need an encoding that can hold binary data; it would be nice if it also had a separate type for text data but that's not a hard requirement.

Bonus points for simplicity, standardization, wide support, and determinism.

bencode -- only stores buffers, no separate text type. Simple. Deterministic.
CBOR -- not simple, can be made deterministic
msgpack -- not simple
BSON -- not simple, not standardized
JSON -- no buffer type, have to base64 encode
protobuf
capnproto
flatbuffers
thrift

Library & API issues

Should the main Javascript Earthstar library return Buffers (node style) or Uint8Array/ArrayBuffer (browser style)?

cinnamon-bun commented 3 years ago

Of course there's also the option of keeping your binary data outside of Earthstar, on some other protocol (IPFS? HTTP?) and just storing the hashes or links in Earthstar. But it would be better if Earthstar could handle it.

BTW, I'm also planning to limit the size of each document to something like ... 3 megabytes? 10 megabytes? If you have larger data to store you'll need to break it into chunks.

That will let us re-use the existing sync system instead of having to write some special code to handle efficient change detection in very large documents.

RangerMauve commented 3 years ago

Regarding the document schema, one thing I've learned from looking at hyper and IPFS was that they default to buffers and provide a encoding field when you read the data to parse it as utf-8 or whatever else. I think it'd be straightforward to have something similar in earthstar.

Regarding the encodings, I think protobufs are probably the most common in the node ecosystem right now, but I personally really like bencoding because you don't need schemas and it generally does less (which in my mind means less to reimplement and learn about).

cinnamon-bun commented 3 years ago

I've been working out options in this rather impossible to understand diagram.

https://www.figma.com/file/mR3pOksIOi38gZxdIxlStj/2020-07-Earthstar-binary-vs-text-issue?node-id=0%3A1

We have to separately consider encodings for storage, network, in-memory, and signing as described here in the specification.

Luckily for signing we only use the contentHash, so as long as we have a consistent way to hash the content (e.g. always as a buffer) we can store it anyway we want without affecting signatures.

Goals:

Works with JSON somehow, which can't hold raw binary data
Can use raw binary data when the underlying context can support it, for efficiency
Feels simple

My favorite two options right now are

1. Add an encoding field, and let the `content` field hold several types

(Note none of the fields shown are used in signing the document, only the contentHash is used for that, so these fields can vary in different contexts.)

// in javascript
contentEncoding: "utf8" | "base64"
content: string | Buffer

// in JSON
contentEncoding: "utf8" | "base64"
content: a string holding either utf8 or base64 encoded buffer

Downside: the content field can hold more than one type which might be annoying in protobuf, SQL, or some programming languages.

2. Use two different content fields. Only one will exist on any given document

// in javascript
contentUtf8: "hello world"
  or
contentBuffer: a Buffer

// in JSON
contentUtf8: "hello world"
  or
contentBase64: "eEh+wdHw=="

Downside: We are trying to avoid optional fields. And they have different names in different contexts.

Thoughts

We will need to conceptually separate the document fields into two categories:

Core fields used in the hashing & signing
Fields ignored by hashing for various reasons: content, signature, and the new fields like contentType because they vary by context

We need to do this anyway because, coming soon, there's 3. local metadata about a document such as received timestamp and number of peers we've uploaded it to.

We're going to need more variations on the Document type.

RangerMauve commented 3 years ago

What about always base64 encoding the content and treating it as binary? Then when you get the content you can specify the encoding you want to use to decode it.

cinnamon-bun commented 3 years ago

@RangerMauve That's an option too, though I didn't consider it very much because of these downsides:

The JSON is no longer user-readable when it contains text, making debugging harder especially for new programmers
Everything gets 33% larger

The plus side is we don't have to know what's binary and what isn't. But the application still needs to know e.g. if it should handle something as a JPG or Markdown.

RangerMauve commented 3 years ago

Regarding jpg vs markdown, I still think that extensions in the path to the document would be a good hint for mime-type.

Like if it ends in .html, it's probably an HTML file.

Everything being 33% larger kinda sucks. :/ How do you feel about msgpack or bencoding or the such for the representation instead of JSON? JSON is nice for human readability but it really makes it hard for anything binary. 😅

cinnamon-bun commented 3 years ago

In the spirit of "use boring tech" I want to try really hard to keep supporting JSON. We might use other encodings in some settings (maybe for p2p streams, for more efficient storage, ...?) but I don't want to lose the possibility of using JSON.

JSON is "withered technology" in a good way

I think if we just include the info somewhere in the document about if the data is binary or UTF-8, everything will work out.

sgwilym commented 2 years ago

I'm now beginning the work of adding support for large blobs to Earthstar. This is a feature request that comes up nearly every time someone is introduced to Earthstar, and we've been talking about how to do it since Earthstar's beginning.

The current situation

Earthstar currently has a 4mb limit on a document's content field.

Thecontent must also be a string.

This means binary data (e.g. an image) has to be encoded to base64 — making it 33% bigger — to get into an Earthstar document.

One of the reasons we haven't done it yet is because we feel very strongly about Earthstar being simple, conceptually and technically. This simplicity makes Earthstar easy to use, and easier for us to predict the second order effects of our designs so that we can mitigate harm to users.

Finding a path towards adding blobs has been hard: how do we get from Earthstar's current model of working with small JSON-serialisable documents, to doing the same with big blobs of binary?

In this issue, we've asked ourselves:

How do we update the Earthstar document format so that it can store bytes as well as strings?
How can we signal what the type of data is that has been stored?
How do we transport these new, byte-ful documents from one peer to another?

There are also some other questions I don't think we've brought up here:

How do we conserve system memory?

When you query documents from Earthstar today, you get an array with complete documents including their content so you can begin working with them right away. On a lower level, this means the documents' contents are all loaded into system memory. This is fine when working with documents containing text, JSON, etc, but what if you're working with a gallery of hi-res imagery? Or a movie? Performance will degrade quickly.

My takeaway: Earthstar will need to have APIs for streaming data associated with a document chunk by chunk..

How do we respect peers who don't want all that data?

With lots of data going over the network it becomes important that peers cannot overwhelm each other by syncing huge amounts of data. Many peers run in environments with constrained storage, e.g. browsers. These peers need to be able to choose which blobs they're interested in just so that they can run.

My takeaway: Earthstar will need to be able to sync sparsely, i.e. transmit metadata without the bytes themselves.

How do we store all this data?

Replicas are able to persist their documents to different storage systems (e.g. Sqlite, LocalStorage) using drivers. The persistence layers we use are well-suited to storing document data, but not suited to storing blobs. I've benchmarked Earthstar query performance while an increasing amount of blobs have been inserted into a Sqlite table, and the performance begins to drop off very fast.

My takeaway: Blobs need to be stored in systems suited to object, rather than document storage. E.g. the filesystem.

My wishlist

All that considered, here's my wishlist:

[ ] The current Earthstar document format remains as it does today.
[ ] Documents remain JSON serialisable
[ ] Peers can obtain blob metadata without getting the blob itself.
[ ] Blobs are not loaded into memory when queried.
[ ] Blobs are stored appropriately.
[ ] Blobs can be streamed chunk by chunk.

Here's my plan to make that all come true.

New format: `esblob.1`

The idea is to introduce a new esblob.1 format to complement the existing es.4.

[x] The current Earthstar document format remains as it does today.

esbytes documents will look like this:

{
  path: "/music/icecream_and_booze.mp3",
  author: "@suzy.bklqpp6wuzv4t4qynjqvd2o7gaefk4776cb67fwo34xu6jfgwyaza",
  size: 35000203,
  timestamp: 1643809349409000,
  deleteAfter: null,
  contentHash: "bgngqc33vltlnywgfhkdoda4if6hmct2s7mctiwehzcs63vbmq63q",
  signature: "bcjpcgxxo7o5setztnoo4mgemlhh7ixloz6tnrdm24xf3gy62nbiafszdjpsdlildyqd6cp5gp46jxb4nummkxjmki4whnlgcjew4caa",
  format: "esblob.1",
  share: "+myshare.a123",
}

There is no content on a esblob document. The document is just a reference to the data itself.

[x] Documents remain JSON serialisable

[x] Peers can obtain blob metadata without getting the blob itself.

[x] Blobs are not loaded into memory when queried.

We still want to obtain the actual blob itself.

New replica object storage

When two peers sync, they will be able to request the blobs they're interested from one another using the esblob doc's signature as a reference.

We'll use signatures rather than content hashes so that outsiders have no way to determine whether a peer holds data they already know the hashes of.

Peers will open a separate connection to obtain those blobs, and pipe them to the replica's chosen method of storing those bytes.

Users will be able to specify how a replica should persist blobs using replica object storage drivers: they could be written to disk, or kept in memory if we know we're only downloading small things like avatars.

[x] Blobs are stored appropriately.

New byte APIs

Users will be able to access blob data using new APIs:

replica.getBytes(path: string): Uint8Array | null | undefined
replica.getStream(path: string): ReadableStream | null | undefined

null indicates the document exists but the associated data has been deleted, undefined indicates that there is no document at the given path.

These APIs will access the underlying object storage driver.

[x] Blobs can be streamed chunk by chunk.

We'll use the existing set method to write new esblob documents:

{
  format: 'es.blob',
  bytes: Uint8Array | ReadableStream,
  path: string
}

Not interested in blobs? Don't use the esblob validator when instantiating a replica.

Here's my planned path to implementing this, from a vertigo-inducingly high level:

Implement multi-format validator support for replicas, so that es.4 and esblob.1 can be used at once.
Implement the new esblob format validator.
Update replicas: add optional object storage drivers to replicas, new APIs for bytes.
Update syncing so that peers can request blobs from each other

There is a lot to do before I even touch implementing blobs themselves, so feedback is more than welcome.

sgwilym commented 2 years ago

I'm going a slightly different path. Rather than have two parallel formats (es.4 and esblob.1), I am instead opting for a new es.5 format with the capabilities of both.

The simplest reason for this is that maintaining two concurrent formats would require implementing certain features twice (such as support for ephemeral docs).

Another reason is that I would like to introduce a new convention where docs with attachments must have a . in their path (/my_song.mp3), to make docs with attachments easier to distinguish by path. es.4 has no notion of this rule, and so es.4 docs can pollute this space.

Final reason: I would like docs with attachments to have some kind of description attached to them so that you can know what some attachment is without having to download it. So es.5 has a text field (similar to es.4's content field) which can be used for this purpose (or for something else if no attachment is present).

earthstar-project / earthstar