BlobRefs are not unambiguously representable as JSON

DavidBuchanan314 commented 1 year ago

According to https://atproto.com/specs/atp#repo-records

Repo records are CBOR-encoded objects (using only JSON-compatible CBOR types).

BlobRefs , which can be contained within various repo record types that use the LexBlob type, are encoded to CBOR using the native CID type (i.e. tagged bytes) in the ref field.

https://github.com/bluesky-social/atproto/blob/37af595a730ef5df67355b3d034634e7111ccef3/packages/lexicon/src/blob-refs.ts#L30

afaict it is currently unspecified how to encode a CID into JSON, but in practice the current implementation converts to a multibase string (which, by the way, is not quite what DAG-JSON does). Looking at the resultant JSON, it is impossible to tell whether the CID string you're seeing came from a CBOR string type, or a CBOR CID type (or in other words, you could have two different CBOR objects that decode into identical JSON objects).

On the other hand, StrongRefs directly encode CIDs as strings.

https://github.com/bluesky-social/atproto/blob/37af595a730ef5df67355b3d034634e7111ccef3/lexicons/com/atproto/repo/strongRef.json#L11

At the very least, there is an inconsistency with how StrongRef CIDs and BlobRef CIDs are encoded as CBOR.

IMHO, BlobRefs should be reworked to look more like StrongRefs (representing CIDs as strings), but this would be a breaking change to how records are serialized to CBOR (but the JSON representation can remain unchanged!).

AaronGoldman commented 1 year ago

Why not just use DAG-JSON and DAG-CBOR? {"/": "<CID>"}

DavidBuchanan314 commented 1 year ago

That does seem like a reasonable thing to do, although it breaks the "using only JSON-compatible CBOR types" constraint. I assume this was a deliberate choice so as not to require DAG-JSON as a dependency (which I certainly appreciated), but maybe that's open to reconsideration.

devinivy commented 1 year ago

First, here's the deal with the CID types in lexicon:

There's { "type": "string", "format": "cid" } which validates strings that are proper CIDs. These are common for integrity checks and tracking record versions (e.g. when you create a "like" record, it validates which version of the post you liked, in case it is changes in the future). In CBOR these are also represented as strings. This is the CID-related type developers building applications will encounter and interact with most often.
There's also { "type": "cid-link" } which is encoded in JSON as { "$link": "<cid>" } and in CBOR using DAG-CBOR's CID tag. This is in a sense a low-level type to lexicon because it also appears within the representation of { "type": "blob" }, as you've mentioned. We use CID links as strong links to content within a repository, blobs being a great example. For example, your profile picture isn't stored in your MST, but there is a strong link from the profile record in your MST to the blob for the picture, which is essentially part of your repository. This is a less common type to encounter in developers' day-to-day usage of the protocol.

We explored using DAG-JSON to represent CID links but it's not where we landed on it. Worth noting that the JSON you find throughout lexicon and the APIs are already not valid DAG-JSON partly because we haven't adopted the constraint that JSON be serialized to some canonical bytes. We do use DAG-CBOR for that purpose, though. And the lexicon data model is self-describing in such a way that the translation from JSON to DAG-CBOR is well-defined. Any time we need canonical bytes for a record (e.g. to generate a CID) we translate from JSON to DAG-CBOR. The repository also represents records in CBOR.

DavidBuchanan314 commented 1 year ago

Ahhh I missed the $link, that makes more sense, and I now see how the self-describing aspect works there.

bluesky-social / atproto

BlobRefs are not unambiguously representable as JSON #1070