naive thoughts on repo storage as append-only logs

bnewbold commented 1 year ago

Unsolicited, I was musing about how to handle bulk storage of repo content in a way that would allow servicing sync requests like the current com.atproto.sync.getRepo with or without a from commit argument.

One possible repo storage format is to keep pseudo-CARv1 files per repo (aka, per DID). The specific ordering of blocks would be to sequentially store commits as a "diff" of the tree compared to the previous commit's state of the repo. Each diff would include the root/commit blocks, any new unique MST nodes, and any new/changed record blocks. Whether media blobs would be stored in the same repo file is optional. New commits are appended in this "diff" format to the file (which is allowed by CARv1, if we don't care about the "root CID" in the header). A file offset index would be maintained separately which would map from commit hash a the starting byte offset in the file; presumably the current "most recent" commit would also be indexed/stored elsewhere.

When a from repo sync request is received, the server would lookup the starting byte offset and current root commit. It would write a CARv1 header (with the current root commit, presumably), then just return all the remaining contents of the file as bytes. These files could be stored in, eg, S3 buckets, or files on local disk, or anywhere HTTP-accessible, and byte range-requests used to fetch just the blocks needed. If the full history is requested, send the full sequence of diffs, which represents the full repo history with no duplication or assembly required.

This is sort of assuming that repo sync requests are even frequent or important to worry about at all. Eg, if streams of updates ("firehose") are already being pushed between federated servers anyways, this would only be important in the context of re-indexing or re-synchronizing huge collections of repos. But that will probably happen! And some PDS instances are likely to end up with way more "sync" read traffic than other requests (eg, imagine PDS for official government announcements, or news sources, or celebrities: few posts, many readers).

Advantages:

code doesn't need to do any significant compute or know much about the structure of repos; validation would be optional
resource efficient: computers like to do sequential reads and writes (appends), not random reads/writes

Disadvantages:

collecting just the current state (commit) of the repo, with no history, as a CAR file, would be more expense. Eg, do a full scan and filter/assemble in memory, then return
no de-duplication between repos, but that is probably rare anyways (aka, exact-byte copies of records in multiple repos)
some S3-style object stores (like AWS S3 itself, I think) don't have an "append" operation on objects, so need to read-append-write to update. this might motivate a caching layer of some form

dholms commented 1 year ago

Yup good intuition on this. @whyrusleeping put together something similar in the gosky code & I'm working on something similar here soon.

on disadvantages:

dedupe: yeah not worried about this one. Besides maybe blobs, I don't expect us to see hardly any duplicate blobs between repos
s3: in this case (and maybe in the general case), it may make sense to store the repos not as one large CAR file, but rather as a bunch of CAR files (one for each commit). The advantage of having commits ordered on disk is seek time, but s3 abstracts sufficiently that I don't think the entire repo should be in one CAR file that gets mutated.
current state: this is the main thing to think through: ad hoc queries for either current state or some slice of the current state. The constituent blocks will be spread out through the CAR file (biased toward the append side, but still no guarantee). We'll probably need a lightweight index that suggests where to find particular blocks in the larger carstore.

bnewbold commented 1 year ago

S3: having so many tiny files (and corresponding number of requests needed) feels like it could be a problem, even in cloud object storage. maybe a compaction process could sweep through periodically? or have some kind of caching for writes? that increases complexity. depends on the ratio of reads/writes I guess. writes are probably spike-y (user sessions with multiple follows/likes/posts)

Current state: yup. I think it might not be too expensive for a service to scan through the full repo history and build the current tree in-memory. and for large repo histories, the result of this could be checkpointed/cached (eg, every 10k commits or something), and then later requests would start at that commit and continue from there.

An alternative way to go would be keeping just the commit metadata and records in some other datastore (not including MST nodes), and generate the MST tree on demand when needed (with verify against the commit).

bluesky-social / atproto

naive thoughts on repo storage as append-only logs #430