element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
68 stars 11 forks source link

Full stack E2EE Testing MEGAISSUE #2165

Open kegsay opened 10 months ago

kegsay commented 10 months ago

Full stack E2EE Tests

Historically, we have had a lot of difficult bugs around encryption. There is a lot of demand for fixing "unable to decrypt" errors and ensuring that our new rust stack is working well. Part of this work involves testing. However, we lack a central set of end-to-end tests for our cryptography in Matrix, beyond basic happy path test cases.

"End-to-End" in this case means:

This issue aims to be the nexus for:

This issue currently lives in the element-meta repository because it touches the entire stack. If there is a better home for this, please let me know where to move it to.

Requirements

These requirements have been formulated purely from my brain. There has been no consensus around this yet.

Any solution MUST:

These "MUST" conditions are formed around the assumptions that we only care about rust SDK crypto, no other client matters. Similarly, we only care about Synapse, no other server matters. We also want these tests to be run on a per-commit basis, so we can spot regressions quickly.

Any solution SHOULD:

These "SHOULD" conditions are formed around the assumption that just testing the happy path isn't enough, and we need the ability to test more edge cases. E2EE in general mostly works in Matrix, so it's the edge cases where we will see the most value.

Any solution COULD:

These "COULD" conditions are generally nice-to-haves and aren't make or break goals. In the wild, there will be different servers and clients, so ensuring we play nicely (or at least know if we don't play nicely) would be useful for the public federation.

Anti-goals:

Prior Work

To my knowledge, the prior work around end-to-end tests which use at least the rust crypto crate includes:

There is also more work which is not end-to-end:

Proposal

Rationale: Existing E2E test frameworks are heavily UI based. This makes it slower, harder to do on CI boxes and less portable as you now need to chuck in an emulator or run it on real devices. As we are only targeting the rust SDK, we can just test it "directly" and bypass the UI layer entirely. This means it should run reasonably quickly on CI boxes (particularly if they make use of Complement's new dirty run mode). In an effort to keep the tests honest and truly "end-to-end", the proposal uses the high-level crate that Element X uses and drives Matrix JS SDK which Element R uses. This keeps the tests "high level": creating rooms, syncing and sending messages, rather than uploading OTKs, querying keys, etc. This should provide more coverage than just testing the matrix_sdk_crypto crate alone, which is important as layers above have a lot of complexity which would otherwise be untested. Using Complement means we can set up mock federation servers which can serve up weird edge cases like reusing OTKs, exhausting OTKs, delaying updates, using unicode device list updates, etc, all of which have caused E2EE problems in the past. Complement now also supports running out-of-repo so the tests needn't sit in the Complement repo (which wouldn't really make much sense as it's mostly testing rust SDK).

Why not:

Definition of Done

There exists a CI step in Rust SDK (and Synapse?) which runs tests which include at a minimum:

[ ] Membership ACLs:

[ ] Key backups:

[ ] One-time Keys:

[ ] Network connectivity:

All of these tests again, but with Alice on a different homeserver (testing federation).

kegsay commented 10 months ago

xref https://github.com/vector-im/element-meta/issues/245

kegsay commented 9 months ago

Tests are now tracked at https://github.com/matrix-org/complement-crypto/blob/main/TEST_HITLIST.md

kegsay commented 9 months ago

Collection of issues found as a result of testing:

Collection of MSCs as a result of this work:

Collection of regressions which could have been caught if Complement-Crypto ran in CI: