bluesky-social / atproto

Social networking technology created by Bluesky
Other
6.48k stars 452 forks source link

BlobRefs are not unambiguously representable as JSON #1070

Closed DavidBuchanan314 closed 1 year ago

DavidBuchanan314 commented 1 year ago

According to https://atproto.com/specs/atp#repo-records

Repo records are CBOR-encoded objects (using only JSON-compatible CBOR types).

BlobRefs , which can be contained within various repo record types that use the LexBlob type, are encoded to CBOR using the native CID type (i.e. tagged bytes) in the ref field.

https://github.com/bluesky-social/atproto/blob/37af595a730ef5df67355b3d034634e7111ccef3/packages/lexicon/src/blob-refs.ts#L30

afaict it is currently unspecified how to encode a CID into JSON, but in practice the current implementation converts to a multibase string (which, by the way, is not quite what DAG-JSON does). Looking at the resultant JSON, it is impossible to tell whether the CID string you're seeing came from a CBOR string type, or a CBOR CID type (or in other words, you could have two different CBOR objects that decode into identical JSON objects).

On the other hand, StrongRefs directly encode CIDs as strings.

https://github.com/bluesky-social/atproto/blob/37af595a730ef5df67355b3d034634e7111ccef3/lexicons/com/atproto/repo/strongRef.json#L11

At the very least, there is an inconsistency with how StrongRef CIDs and BlobRef CIDs are encoded as CBOR.

IMHO, BlobRefs should be reworked to look more like StrongRefs (representing CIDs as strings), but this would be a breaking change to how records are serialized to CBOR (but the JSON representation can remain unchanged!).

AaronGoldman commented 1 year ago

Why not just use DAG-JSON and DAG-CBOR? {"/": "<CID>"}

DavidBuchanan314 commented 1 year ago

That does seem like a reasonable thing to do, although it breaks the "using only JSON-compatible CBOR types" constraint. I assume this was a deliberate choice so as not to require DAG-JSON as a dependency (which I certainly appreciated), but maybe that's open to reconsideration.

devinivy commented 1 year ago

First, here's the deal with the CID types in lexicon:

We explored using DAG-JSON to represent CID links but it's not where we landed on it. Worth noting that the JSON you find throughout lexicon and the APIs are already not valid DAG-JSON partly because we haven't adopted the constraint that JSON be serialized to some canonical bytes. We do use DAG-CBOR for that purpose, though. And the lexicon data model is self-describing in such a way that the translation from JSON to DAG-CBOR is well-defined. Any time we need canonical bytes for a record (e.g. to generate a CID) we translate from JSON to DAG-CBOR. The repository also represents records in CBOR.

DavidBuchanan314 commented 1 year ago

Ahhh I missed the $link, that makes more sense, and I now see how the self-describing aspect works there.