element-hq / element-meta

Shared/meta documentation and project artefacts for Element clients
65 stars 11 forks source link

Crypto: Posthog analytics for problems when sending message keys over to-device messages #2409

Open richvdh opened 2 months ago

richvdh commented 2 months ago

There are various failure modes that can lead to problems sending to-device messages containing message keys, which will in turn lead to UTD errors. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.

Likely root causes are the target user's homeserver being unreachable (related: https://github.com/element-hq/element-meta/issues/2154), or our own homeserver being unresponsive. More specific examples include:

See also #234 which covers the receiving side of this (and is IMHO much lower-hanging fruit).

Question

A single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What exactly are we trying to achieve with these metrics?


Implementation design

Slightly tricky because the list of things we need to report on are scattered around the codebase, though it is mostly within matrix-sdk-crypto. I think the first step here is to define an interface in matrix-sdk-crypto which emits an enum of potential error codes.

We can then add a method OlmMachine::share_room_keys_failure_stream, which returns a Stream, and each time something on the list above goes wrong, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.

BillCarsonFr commented 1 month ago

Having analytics for failure to decrypt to_device messages would be more usefull now.