bluesky-social / proposals

Bluesky proposal discussions
91 stars 9 forks source link

firehose2 #69

Open DavidBuchanan314 opened 1 week ago

DavidBuchanan314 commented 1 week ago

I'm working on a draft for how "firehose" bandwidth could be significantly reduced, without making any sacrifices in terms of authentication of data.

The gist of the changes are:

It's at a very early stage right now (I haven't yet written code to benchmark my changes), but in the interests of developing in the open and getting early feedback, I'm logging my progress here: https://github.com/DavidBuchanan314/firehose2/

devinivy commented 1 week ago

Nice, this is great. I have some initial reactions (personal reactions, not necessarily representative of the bsky team).

Not transmitting MST blocks (draft spec'd)

I am a fan of the approach. One upside that I appreciate is that it's quite tidy to be able to say producers and verifying consumers all just need to know how to write to a repo.

Adding a compression layer - likely based on zstd (TODO) (I will also investigate the likes of permessage-deflate, but I suspect doing compression at the application level will give more control and better overall results)

I'm into the idea of supporting compression. The main downside of doing something at the app level is that it could just be quite heavy on the protocol to specify. I expect it would have to include some form of negotiation, even if only for the purpose of adaptability/future-proofing.

Even if we come up with something optimal outside of permessage-deflate, it would be useful to fit it into the websocket compression extension framework rfc7692. The sync protocol wouldn't strictly depend on a bespoke compression strategy: could just say "you're welcome to use websocket compression extensions, we recommend this one". We'd inherit a system for negotiating the compression extension and its parameters (adaptability/future-proofing). The extension framework exists and I'm pretty sure it suits our problem—I don't think it's overly prescriptive in a way that would make the compression less effective but would be interested to learn if it is.

Improving the compressibility of relevant on-wire formats

I'm fairly into this too. I believe you generally need to compute all the CIDs from block contents anyway as part of the verification process, in which case it's not all that useful to transport the CIDs themselves. This would of course mean departing from transporting CARs altogether, which I imagine could be on the table if we're talking about a proper v2. At the same time I imagine it's possible to go "too far" and come up with a weird structure by optimizing hard for compressibility, so would be interested to see what it looks like.

opsCid

I'm interested to hear a little more about what opsCid is and isn't intended to help with. On today's firehose including full proofs, it would help consumers who only validate proofs against ops but don't keep repo structure locally—e.g. to protect from withheld/deletion ops. In this v2 it seems like it is superfluous in the case of full repo sync, but in the case of syncing a slice it helps verify the ops of an individual commit. If you're syncing a slice presumably you're trying not to witness every commit, though, and in those cases I suppose you don't know what you might be missing. I think it also means that you can't verify ops for firehose events that represent a coarse diff from multiple commits combined together, which is permitted today. I still feel like there could be some juice here, but could use help mapping it out.

Also—in the case of the full repo sync you can elide all the data blocks except the root, as you point out. I don't have a precise description of it, but I believe there is a generalization of that applying to sync of repo slices. In the case of syncing a collection, I believe you can elide all the data blocks except those on the "boundary" of the collection up to the root. I haven't worked out if you can still usually avoid transmitting data blocks if you are always writing to the right side of a collection, as we typically do since TIDs are monotonic. Or if you may need to transmit some blocks when writing to an adjacent collection to the one you want to sync. I think it could be worth mapping this all out though, see if there's something we can exploit there.