MarshalX / atproto

The AT Protocol (🦋 Bluesky) SDK for Python 🐍
https://atproto.blue
MIT License
327 stars 33 forks source link

sharing resource: interop-test-files #147

Closed bnewbold closed 1 year ago

bnewbold commented 1 year ago

Hey @MarshalX (and others contributing to this repo)!

Wanted to share a resource and see if you had any feedback. We have started collecting some cross-language test vectors at: https://github.com/bluesky-social/atproto/tree/main/interop-test-files

For example, long lists of valid and invalid identifiers, trying to hit corner cases. These are intended to be easy to copy directly in to other implementations. So far they are used in both the typescript code and golang implementation (indigo).

Would be curious what other test files would be helpful for ensuring inter-operation between implementations, particularly if there have been any sharp edges that you have run in to.

We will probably add:

MarshalX commented 1 year ago

Hi @bnewbold! That's a brilliant idea! To be honest, I invented something similar for my unit tests.

I have a data collector: https://github.com/MarshalX/atproto/blob/main/tests/models/fetch_test_data.py And the saved data: https://github.com/MarshalX/atproto/tree/main/tests/models/test_data

As you can see, the most problematic edge cases are about custom lexicon and parsing. For example parsing of literals and Union types. It will be awesome to have only one database with the whole test data!

The test data for the CAR file will be useful for my python-libipld project.

Some bugs that I have with parsing and it's covered by unit tests now:

Also, I am confused about this and idk is it a reproducible thing. Any Union type could be any object with $type and fields: Code ref: https://github.com/bluesky-social/atproto/blob/b01e47b61730d05a780f7a42667b91ccaa192e8e/packages/lex-cli/src/codegen/lex-gen.ts#L325. It will be awesome to cover this case with the test data too.

bnewbold commented 1 year ago

Handling unspecified extensions to existing Lexicons is not very flushed out, both in our implementations and in the specifications. It should all come together at the protocol level, but how to expose things well in other languages, especially strongly typed, isn't totally clear. And there are sharp edges around decoding/recording records.

Likewise what it looks like to work with records (or other data, in JSON or DAG-CBOR) for which a Lexicon is not available. There are restrictions on what that data can look like: records are always supposed to have $type for example, and Links (CIDs) and binary data ($bytes) are encoded a particular way. But it isn't very clear if folks are expected to parse and enforce those constraints rigorously, and if so in what situations.

We do want it to be possible to parse through data with an unknown schema and identity things like blobs. Moderation tooling can use this to do things like extract and label blobs even if the application lexicon isn't known. When we added self-labels recently, we also did that in a way to ensure that the $type is always set, to make it possible to parse and extract labels, even if the lexicon isn't known. This is all somewhat unexplored territory though.

For unions, yes, open unions are very flexible. Implementations are only really expected to handle the enumerated (known) types, and just "not fail" when they encounter an unknown type in that position. This is distinct from closed unions, where an unknown type would be an error. IIRC there might be ambiguity about whether literals are allowed in unions, because there is no $type field.

bnewbold commented 1 year ago

Had not seen python-libipld, using Rust, interesting! Hope it will be possible to do a pure-python implementation of some kind, but using a hardened/safe/fast existing library for (DAG-)CBOR and CAR files makes a lot of sense.

Thanks for these notes, will review and fold this in to our test vectors.

MarshalX commented 1 year ago

@bnewbold thank you for your explanation!

Had not seen python-libipld, using Rust, interesting! Hope it will be possible to do a pure-python implementation of some kind, but using a hardened/safe/fast existing library for (DAG-)CBOR and CAR files makes a lot of sense.

Thanks for these notes, will review and fold this in to our test vectors.

Pure Python implementation exists but is abandoned and slow as hell. That's why I moved to Rust. Pls check the recent performance boost update: https://github.com/MarshalX/atproto/releases/tag/v0.0.26