Posthog: Add more properties to unable to decrypt errors to allow more filtering options

BillCarsonFr commented 8 months ago

Currently the captured Error events for unable to decrypt errors do not contain much information. This lack of information is preventing us from getting interesting insights from the raw data.

Goal: We want to add some new properties that will allow us to better filter out Posthog insights.

List of needed properties:

eventLocalAgeAtDecryptionFailure (numeric, time in millis). An heuristic based on event origin_server_ts and the current device creation time. This would be used to get the source of the event scrollback/live/initialSync. This would be used in Posthog by creating filter to group events in buckets (e.g <0 (historical), <60s, <5min, <1h, >1h)
userTrustsOwnIdentity (boolean). Would allow to know if the current session was trusted at the time of decryption
isFederated (boolean). true if userDomain != senderDomain. Would allow to filter UTDs where federation is involved.
timeToDecryptMillis (numeric, time in millis) : Currently there is a grace period of 4s, after that UTDs are reported. We would like to improve the existing mecanism by adding a timeToDecrypt properties when an event is decrypted late. We should track decryption errors for 1mn (TBD), if events are decrypted during this period we still report an UTD with the time it took to decrypt, after 1mn we report anyhow. Would allow to filter on posthog temporary vs definitive UTD.
isMatrixDotOrg (boolean). True if the current user is using Matrix.org. Combined with isFederated would allow to create a filter where we are sure that both users are running a recent synapse.
wasVisibleToUser (boolean). True if that unable to decrypt error was visible to the user. For example, some clients can hide part of the history and only show it after verification. In that case the UTDs won't be displayed on screen and we want to filter that.

### Tasks
- [ ] https://github.com/matrix-org/matrix-analytics-events/pull/92
- [ ] https://github.com/matrix-org/matrix-analytics-events/pull/93
- [ ] https://github.com/matrix-org/matrix-analytics-events/pull/97
- [ ] https://github.com/element-hq/element-web/issues/27168
- [ ] https://github.com/element-hq/element-android/issues/8779
- [ ] https://github.com/element-hq/element-ios/issues/7767

BillCarsonFr commented 8 months ago

These will also be improvements/addition to https://github.com/element-hq/element-meta/issues/2300

richvdh commented 8 months ago

eventLocalAgeAtDecryptionFailure

Is it better to log the age relative to the session, or to log the session-creation-time and the timestamp on the event as separate fields (and then do the calculation on the posthog side)?

BillCarsonFr commented 8 months ago

eventLocalAgeAtDecryptionFailure

Is it better to log the age relative to the session, or to log the session-creation-time and the timestamp on the event as separate fields (and then do the calculation on the posthog side)?

AFAIK There is no easy way to do that on posthog side (not in the query to display graphs), maybe a posthog plugin where you can pre-process and decorate events.

richvdh commented 8 months ago

isFedarated

isFederated

richvdh commented 8 months ago

eventLocalAgeAtDecryptionFailure (numeric, time in seconds)

timeToDecrypt (numeric, time in seconds)

It looks like other "time" fields in the analytics events measure time in milliseconds. Please let's try to be consistent.

richvdh commented 8 months ago

timeToDecrypt (numeric, time in seconds) : Currently there is a grace period of 4s, after that UTDs are reported. We would like to improve the existing mecanism by adding a timeToDecrypt properties when an event is decrypted late. We should track decryption errors for 1mn (TBD), if events are decrypted during this period we still report an UTD with the time it took to decrypt, after 1mn we report anyhow. Would allow to filter on posthog temporary vs definitive UTD.

Trying to understand this a bit better. Could you explain what will happen under this plan for each of:

An event decrypted in more than 0s but less than 4s (it is not reported to posthog at all?)
An event decrypted in more than 4s but less than 60s (it is reported to posthog, but with an appropriate timeToDecrypt ? How does this relate to https://github.com/element-hq/element-meta/issues/2303?)
An event which is still not decrypted after 60s (we give up waiting, and report it to posthog with timeToDecrypt set to ... +inf ?)

BillCarsonFr commented 8 months ago

timeToDecrypt (numeric, time in seconds) : Currently there is a grace period of 4s, after that UTDs are reported. We would like to improve the existing mecanism by adding a timeToDecrypt properties when an event is decrypted late. We should track decryption errors for 1mn (TBD), if events are decrypted during this period we still report an UTD with the time it took to decrypt, after 1mn we report anyhow. Would allow to filter on posthog temporary vs definitive UTD.

Trying to understand this a bit better. Could you explain what will happen under this plan for each of:
* An event decrypted in more than 0s but less than 4s (it is not reported to posthog at all?)

* An event decrypted in more than 4s but less than 60s (it is reported to posthog, but with an appropriate `timeToDecrypt` ? How does this relate to [Posthog | Increase decryption failure grace period #2303](https://github.com/element-hq/element-meta/issues/2303)?)

* An event which is still not decrypted after 60s (we give up waiting, and report it to posthog with `timeToDecrypt` set to ... `+inf` ?)

Yes so far I think:

Decrypted under 4s not reported (mostly to be a bit backward compatible)
Decrypted before Max Wait time (60s) Report with timeToDecrypt it took
If after Max wait time it's still not decrypted we report with a timeTodecrypt > MAX (we could put some +inf value also)

The max wait period has to be defined, the main risk is that this wait time is too long and the client is closed meanwhile and the error not reported at all.

So maybe it should be 30s and not 60s

BillCarsonFr commented 8 months ago

eventLocalAgeAtDecryptionFailure (numeric, time in seconds)

timeToDecrypt (numeric, time in seconds)

It looks like other "time" fields in the analytics events measure time in milliseconds. Please let's try to be consistent.

Yes, updated

richvdh commented 8 months ago

Yes so far I think:

Decrypted under 4s not reported (mostly to be a bit backward compatible)

Decrypted before Max Wait time (60s) Report with timeToDecrypt it took

If after Max wait time it's still not decrypted we report with a timeTodecrypt > MAX (we could put some +inf value also)

The max wait period has to be defined, the main risk is that this wait time is too long and the client is closed meanwhile and the error not reported at all.

So maybe it should be 30s and not 60s

So do we do this instead of https://github.com/element-hq/element-meta/issues/2303?

BillCarsonFr commented 8 months ago

So do we do this instead of #2303?

I think so, let's confirm at the daily

BillCarsonFr commented 8 months ago

Closed https://github.com/element-hq/element-meta/issues/2303 and do this task instead

BillCarsonFr commented 8 months ago

For UTD that are not resolved after the max_time, we might want to report timeToDecypt=-1 That would allow to filter more easilly

andybalaam commented 7 months ago

@BillCarsonFr will break into separate tasks.

uhoreg commented 5 months ago

@BillCarsonFr All the tasks in this issue are marked as done. Can this issue be closed, or is there more work to do here?

BillCarsonFr commented 5 months ago

yes let's close there is a roundoff for EX

element-hq / element-meta

Posthog: Add more properties to unable to decrypt errors to allow more filtering options #2332