Open BillCarsonFr opened 2 years ago
Moved to qualification again following new discussions during rust meeting
There are two parts to this, encryption and decryption.
This is relatively easy.
https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new domain
of TO_DEVICE
, and a name
of ToDeviceFailedToDecrypt
, to the properties of the current Error
event. However, ToDeviceFailedToDecrypt
seems insufficiently specific, and instead we need a number of new names
.
As a first pass, I suggest adding a new name
for each type of OlmError
in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix of ToDeviceFailedToDecrypt_
.
There is also an optional freeform string field context
, though its purpose seems a bit unclear. I suggest not using it for now.
So an example event would be:
{
"event": "Error",
"properties": {
"domain": "TO_DEVICE",
"name": "ToDeviceFailedToDecrypt_SessionWedged",
"cryptoSDK": "Rust",
}
}
I think probably the best way to implement this is to add a method OlmMachine::to_device_decryption_failure_stream
, which returns a Stream
, and each time OlmMachine::decrypt_to_device_event
fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.
This is harder because the list of things we need to report on are scattered more widely around the codebase. I think the first step here is to define an interface in matrix-sdk-crypto which emits an enum
of potential error codes. Initial ideas of things that should definitely be included:
Again, we can probably expose a Stream
interface from OlmMachine
.
Question: a single sent message could result in hundreds or thousands of errors, depending on the number of devices in the room. Similarly, a single failing user could cause lots of different sent messages to have some sort of error. Should we report an event for each device for each user for each message? Or something more intelligent? What are we trying to achieve with this metric?
I have split the sending side of this out to #2409, and updated the description of this issue for clarity and to include design notes.
There are various failure modes that can lead to failures to decrypt to-device messages, which will then almost inevitably lead to UTD errors of some sort. Currently, these are not reported in Posthog, so we lack visibility into how often they happen.
See also https://github.com/element-hq/element-meta/issues/2409 which covers analytics for the sending side.
Implementation design
https://github.com/matrix-org/matrix-analytics-events/pull/56 did some groundwork here, by adding a new
domain
ofTO_DEVICE
, and aname
ofToDeviceFailedToDecrypt
, to the properties of the currentError
event. However,ToDeviceFailedToDecrypt
seems insufficiently specific, and instead we need a number of newnames
.As a first pass, I suggest adding a new
name
for each type ofOlmError
in matrix-crypto-sdk (https://github.com/matrix-org/matrix-rust-sdk/blob/main/crates/matrix-sdk-crypto/src/error.rs), with a prefix ofToDeviceFailedToDecrypt_
.There is also an optional freeform string field
context
, though its purpose seems a bit unclear. I suggest not using it for now.So an example event would be:
I think probably the best way to implement this is to add a method
OlmMachine::to_device_decryption_failure_stream
, which returns aStream
, and each timeOlmMachine::decrypt_to_device_event
fails, we write a new entry to the stream. The stream could then be wrapped in both (Rust) matrix-sdk and matrix-js-sdk, for turning into Posthog events.Sub-tasks
Name
s forErrorEvent
in matrix-analytics-eventsOlmMachine::to_device_decryption_failure_stream
tomatrix-sdk-crypto
matrix-sdk
to turn failures into posthog events