BillCarsonFr commented 2 years ago

There are various failure modes that can lead to failures to decrypt to-device messages, which will then almost inevitably lead to UTD errors of some sort. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.

See also https://github.com/element-hq/element-meta/issues/2409 which covers analytics for the sending side.

Implementation design

https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new domain of TO_DEVICE, and a name of ToDeviceFailedToDecrypt, to the properties of the current Error event. However, ToDeviceFailedToDecrypt seems insufficiently specific, and instead we need a number of new names.

As a first pass, I suggest adding a new name for each type of OlmError in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix of ToDeviceFailedToDecrypt_.

There is also an optional freeform string field context, though its purpose seems a bit unclear. I suggest not using it for now.

So an example event would be:

{
    "event": "Error",
    "properties": {
        "domain": "TO_DEVICE",
        "name": "ToDeviceFailedToDecrypt_SessionWedged",
        "cryptoSDK": "Rust",
    }
}

I think probably the best way to implement this is to add a method OlmMachine::to_device_decryption_failure_stream, which returns a Stream, and each time OlmMachine::decrypt_to_device_event fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.

Sub-tasks

[ ] Add more Names for ErrorEvent in matrix-analytics-events
[ ] Add OlmMachine::to_device_decryption_failure_stream to matrix-sdk-crypto
[ ] Add support in matrix-sdk to turn failures into posthog events

BillCarsonFr commented 7 months ago

Moved to qualification again following new discussions during rust meeting

richvdh commented 6 months ago

There are two parts to this, encryption and decryption.

Decryption

This is relatively easy.

https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new domain of TO_DEVICE, and a name of ToDeviceFailedToDecrypt, to the properties of the current Error event. However, ToDeviceFailedToDecrypt seems insufficiently specific, and instead we need a number of new names.

As a first pass, I suggest adding a new name for each type of OlmError in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix of ToDeviceFailedToDecrypt_.

There is also an optional freeform string field context, though its purpose seems a bit unclear. I suggest not using it for now.

So an example event would be:

{
    "event": "Error",
    "properties": {
        "domain": "TO_DEVICE",
        "name": "ToDeviceFailedToDecrypt_SessionWedged",
        "cryptoSDK": "Rust",
    }
}

I think probably the best way to implement this is to add a method OlmMachine::to_device_decryption_failure_stream, which returns a Stream, and each time OlmMachine::decrypt_to_device_event fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.

Encryption

This is harder because the list of things we need to report on are scattered more widely around the codebase. I think the first step here is to define an interface in matrix-sdk-crypto which emits an enum of potential error codes. Initial ideas of things that should definitely be included:

We lack a device list and the key query failed (or the user is on a server we backed off from)
We failed to /keys/claim for a given device (or have backed off for this device)
We couldn't send the to-device message

Again, we can probably expose a Stream interface from OlmMachine.

Question: a single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What are we trying to achieve with this metric?

richvdh commented 5 months ago

I have split the sending side of this out to #2409, and updated the description of this issue for clarity and to include design notes.

element-hq / element-meta

Crypto | Posthog analytics for to-device decryption errors #234

Implementation design

Sub-tasks

Decryption

Encryption