bluesky-social / atproto

Social networking technology created by Bluesky
Other
6.14k stars 431 forks source link

naive thoughts on repo storage as append-only logs #430

Closed bnewbold closed 1 year ago

bnewbold commented 1 year ago

Unsolicited, I was musing about how to handle bulk storage of repo content in a way that would allow servicing sync requests like the current com.atproto.sync.getRepo with or without a from commit argument.

One possible repo storage format is to keep pseudo-CARv1 files per repo (aka, per DID). The specific ordering of blocks would be to sequentially store commits as a "diff" of the tree compared to the previous commit's state of the repo. Each diff would include the root/commit blocks, any new unique MST nodes, and any new/changed record blocks. Whether media blobs would be stored in the same repo file is optional. New commits are appended in this "diff" format to the file (which is allowed by CARv1, if we don't care about the "root CID" in the header). A file offset index would be maintained separately which would map from commit hash a the starting byte offset in the file; presumably the current "most recent" commit would also be indexed/stored elsewhere.

When a from repo sync request is received, the server would lookup the starting byte offset and current root commit. It would write a CARv1 header (with the current root commit, presumably), then just return all the remaining contents of the file as bytes. These files could be stored in, eg, S3 buckets, or files on local disk, or anywhere HTTP-accessible, and byte range-requests used to fetch just the blocks needed. If the full history is requested, send the full sequence of diffs, which represents the full repo history with no duplication or assembly required.

This is sort of assuming that repo sync requests are even frequent or important to worry about at all. Eg, if streams of updates ("firehose") are already being pushed between federated servers anyways, this would only be important in the context of re-indexing or re-synchronizing huge collections of repos. But that will probably happen! And some PDS instances are likely to end up with way more "sync" read traffic than other requests (eg, imagine PDS for official government announcements, or news sources, or celebrities: few posts, many readers).

Advantages:

Disadvantages:

dholms commented 1 year ago

Yup good intuition on this. @whyrusleeping put together something similar in the gosky code & I'm working on something similar here soon.

on disadvantages:

bnewbold commented 1 year ago

S3: having so many tiny files (and corresponding number of requests needed) feels like it could be a problem, even in cloud object storage. maybe a compaction process could sweep through periodically? or have some kind of caching for writes? that increases complexity. depends on the ratio of reads/writes I guess. writes are probably spike-y (user sessions with multiple follows/likes/posts)

Current state: yup. I think it might not be too expensive for a service to scan through the full repo history and build the current tree in-memory. and for large repo histories, the result of this could be checkpointed/cached (eg, every 10k commits or something), and then later requests would start at that commit and continue from there.

An alternative way to go would be keeping just the commit metadata and records in some other datastore (not including MST nodes), and generate the MST tree on demand when needed (with verify against the commit).