element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
72 stars 12 forks source link

Crypto | Posthog analytics for to-device decryption errors #234

Open BillCarsonFr opened 2 years ago

BillCarsonFr commented 2 years ago

There are various failure modes that can lead to failures to decrypt to-device messages, which will then almost inevitably lead to UTD errors of some sort. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.

See also https://github.com/element-hq/element-meta/issues/2409 which covers analytics for the sending side.


Implementation design

https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new domain of TO_DEVICE, and a name of ToDeviceFailedToDecrypt, to the properties of the current Error event. However, ToDeviceFailedToDecrypt seems insufficiently specific, and instead we need a number of new names.

As a first pass, I suggest adding a new name for each type of OlmError in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix of ToDeviceFailedToDecrypt_.

There is also an optional freeform string field context, though its purpose seems a bit unclear. I suggest not using it for now.

So an example event would be:

{
    "event": "Error",
    "properties": {
        "domain": "TO_DEVICE",
        "name": "ToDeviceFailedToDecrypt_SessionWedged",
        "cryptoSDK": "Rust",
    }
}

I think probably the best way to implement this is to add a method OlmMachine::to_device_decryption_failure_stream, which returns a Stream, and each time OlmMachine::decrypt_to_device_event fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.

Sub-tasks

BillCarsonFr commented 7 months ago

Moved to qualification again following new discussions during rust meeting

richvdh commented 6 months ago

There are two parts to this, encryption and decryption.

Decryption

This is relatively easy.

https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new domain of TO_DEVICE, and a name of ToDeviceFailedToDecrypt, to the properties of the current Error event. However, ToDeviceFailedToDecrypt seems insufficiently specific, and instead we need a number of new names.

As a first pass, I suggest adding a new name for each type of OlmError in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix of ToDeviceFailedToDecrypt_.

There is also an optional freeform string field context, though its purpose seems a bit unclear. I suggest not using it for now.

So an example event would be:

{
    "event": "Error",
    "properties": {
        "domain": "TO_DEVICE",
        "name": "ToDeviceFailedToDecrypt_SessionWedged",
        "cryptoSDK": "Rust",
    }
}

I think probably the best way to implement this is to add a method OlmMachine::to_device_decryption_failure_stream, which returns a Stream, and each time OlmMachine::decrypt_to_device_event fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.


Encryption

This is harder because the list of things we need to report on are scattered more widely around the codebase. I think the first step here is to define an interface in matrix-sdk-crypto which emits an enum of potential error codes. Initial ideas of things that should definitely be included:

Again, we can probably expose a Stream interface from OlmMachine.

Question: a single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What are we trying to achieve with this metric?

richvdh commented 5 months ago

I have split the sending side of this out to #2409, and updated the description of this issue for clarity and to include design notes.