Add posthog tracking for UTDs in Element X

VolkerJunginger commented 7 months ago

Your use case

Add posthog tracking for UTDs.

Event: Error Crypto SDK: Rust Domain E2E

I am not a 100% sure that this what is required for the tracking. Please align also w. the CryptoTeam.

Have you considered any alternatives?

No response

Additional context

No response

### Tasks
- [ ] https://github.com/matrix-org/matrix-rust-sdk/pull/3206
- [ ] https://github.com/element-hq/element-x-android/pull/2544
- [ ] https://github.com/element-hq/element-x-ios/pull/2575

BillCarsonFr commented 7 months ago

Every unable to decrypt should be reported.

The error are mapped to poshtog like that:

{
                    eventName: "Error",
                    domain: "E2EE",
                    name: "OlmKeysNotSentError"
                    timeToDecryptMillis: number (if late utd) | -1 (if final utd),
                    cryptoSDK: "Rust"
}

The error name should be on of those, but as for now the SDK is not reporting the error type, we hardcode to OlmKeysNotSentError

OlmKeysNotSentError: unknown session id
OlmIndexError: ratcheted key (session is known but ratcheted)
OlmUnspecifiedError: Other olm error (mac / signatures...)
UnknownError: anything else (unsupported algorithm?.., replay?)

~~context allows to get the raw client error (might differ depending on client)~~ Ignored for now as the context is not available yet

Decryption error are reported per event. So if 2 messages using the same sessionId are in UTD, it should report twice.

Clients should at least have some mecanism to not report several time the same event. Though non of the existing clients are persiting it, i.e after a restart the same event would be reported.

The cryptoSDK property should be also added to the $screen events, as we are doing ration of unique UTD unique Screen.

To filter out EIX ans EAX we use a App Name (Element X) and Library (posthog-ios or posthog-android)

manuroe commented 7 months ago

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

BillCarsonFr commented 7 months ago

@BillCarsonFr how useful and feasible would be to have the UTD reports logic in the crypto crate?

I am asking because we have nothing on EX. The timeline is managed by the SDK. It would be great to have a single logic component at least in the SDK but maybe in the crypto crate as it could serve EWR too.

The logic will be in the rust-sdk I believe. Modulo can we know on sdk side when a message is actually visible on screen?

richvdh commented 7 months ago

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

VolkerJunginger commented 6 months ago

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

BillCarsonFr commented 6 months ago

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

BillCarsonFr commented 6 months ago

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

BillCarsonFr commented 6 months ago

One of the solution for implementation could be to use the existing rust-tracer and create a subscriber for analytics. There is a crate for sentry https://crates.io/crates/sentry-tracing not for posthog though

VolkerJunginger commented 6 months ago

From my perspective the users experience is what counts. So if a user experiences an error because of a UTD this is what we should count. If we want to introduce a grace period at all I would go with 5 seconds max. This is no scientific number but I would consider any experience that takes longer to be broken.

Unfortunalty we can't have exactly what we want. The current grace period is 4s and we are increasing it to stabilize the metrics, as it appears that there is a good amount of late keys on other clients (maybe less common with sliding sync).

When we will have ressources we will be able to create a better system, something more like time-to-decrypt in seconds. That would allow us in the analytics tool to move the threshold.

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

neilisfragile commented 6 months ago

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

fkwp commented 6 months ago

Am I getting it right: You increase the grace period to get lower numbers? That's cheating if you ask me.

If we have the ambition to create a great user-experience and believe this comes with no user having to wait for more than 4 sec that their messages decrypt we should stay with the metric as they are.

tbh. this is just a pragmatic short cut. The proper "solution" would be to record the time needed (delay between message arrived and key arrived) and present it in a histogram. This way we could learn a lot about the process... However, adding the missing bits to gather this information is just to expensive.

richvdh commented 6 months ago

Modulo can we know on sdk side when a message is actually visible on screen?

Visibility on screen isn't mentioned in the earlier requirements. Does a message have to be on screen (and still on screen after the grace period?) for it to be counted as a UTD for metrics? I think so, but that means a user has to be staring at a UTD for 30 seconds for it to count for the metrics?

No it's just that it was displayed on screen once. Why would we do a stare at it for 30s thing??

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats. Maybe that already happens, but the logs in https://github.com/element-hq/element-x-ios-rageshakes/issues/1377 suggest not. I'd like someone more familiar with the Element X codebase to explain what's going on there.

BillCarsonFr commented 6 months ago

It's fine, but if we go down that path, we need to make sure that if the key arrives after 20s, we attempt to decrypt the message again even if it is no longer on screen, so that we can exclude that message from the stats.

Yes we need to review all the existing code and ensure it's doing what we want.

I'd like someone more familiar with the Element X codebase to explain what's going on there.

AFAIK as I looked some times ago, I think this logic is in the Timeline object, there is a event_handler added that will call retry_decryption when a room_key/fowarded room_key is received. So looks like if the Timeline is dropped, it will not retry? I don't know how this work for backup though.

Also, I was surprised that it was not using the signaling in the crypto crate (there is a stream when keys are added/udpated in store).

BillCarsonFr commented 6 months ago

My understanding for the rationale in increasing the grace period is not to juice the numbers but to differentiate permanently failing messages from temporarily failing messages.

Both are bad, but permanently failing messages are much worse.

So by increasing the period we can filter out the temporary failures and can focus on the permanent ones, accepting that the temporary failures also need work further down the line.

The other option would be to report two windows, so a 4s (say) and 30 sec, but I think the team are trying to get the simplest thing possible to unblock them.

@BillCarsonFr can you just confirm my understanding is correct?

Exactly, in the perfect world we would to the perfect thing. As you said we want to filter out temporary failures for now

BillCarsonFr commented 6 months ago

Regarding Visibility on screen, we might just add a field in the captured event (visibleOnScreen or something like that). We will then be able to filter out on posthog if we want

BillCarsonFr commented 6 months ago

Quick Notes on the discussions:

The SDK API is not at the moment exposing the type of error that did occur (MissingRoomKey/Megolm), so we can't find the correct name to put in the Error event.

Decision: We don't block on that and hardcode OlmKeysNotSentError. We decided to use OlmKeysNotSentError and not UnknownError because on other platforms OlmKeysNotSentError is 90% of the reported errors, UnknownError is very low. So would be more accurate to default to OlmKeysNotSentError for now

bnjbvr commented 6 months ago

The SDK initial implementation has just been landed, so unassigning myself — it's now in the hands of our app devs!

manuroe commented 6 months ago

First UTDs analytics will be available in EXA 0.4.6 and EXI 1.5.12. We should make those versions public in the store today.

element-hq / element-meta