element-hq / riot-android

A glossy Matrix collaboration client for Android
Apache License 2.0
1.4k stars 394 forks source link

Stuck in a loop generating and failing to upload one-time keys #1289

Open richvdh opened 7 years ago

richvdh commented 7 years ago

https://github.com/matrix-org/riot-android-rageshakes/issues/186 includes two instances where the user receives an incoming Olm session, but the session cannot be established due to BAD_MESSAGE_KEY_ID.

This means that the user's device didn't recognise the one-time-key the sender used to establish the session. That might be due to one of several things:

I guess the latter is more likely. It's very hard to tell from the logs, though.

(At least some of these failures seem to be the first message sent from the other side, which means it isn't due to Coffee's device receiving the first message, setting up the session, deleting the one time key, then forgetting the session.)

richvdh commented 7 years ago

https://github.com/matrix-org/riot-android-rageshakes/issues/194 shows another example (from the same device)

richvdh commented 7 years ago

I was able to reproduce this when initiating a new olm session with Coffee. It appears that his device had forgotten a load of the one-time-keys that had been published to the database.

On 14 May (08:53:41 UTC) his device appears to have published at least 100 new signed_curve25519 keys to the server, with key_ids ranging from AAAAWg (0x5a) to AAAAvQ (0xbd) - I'd like to investigate what might have caused that.

richvdh commented 7 years ago

Inspection of the server logs provides at least a partial answer.

On 11 May (11:16:15), Coffee's device decided to upload a new one-time key to the server; it added key id AAAAWA (0x58).

At 11:52:37, it decides to upload another, but also gives the new key id AAAAWA. The server rejects the request.

For the next few days, it gets stuck in a loop: every 60 seconds, it:

Eventually, on 14th May, it has generated so many new keys which it hasn't uploaded (specifically, 100) that it starts forgetting about some, starting with AAAAWA - which means that the next upload request succeeds and uploads the 100 brand-new keys. Meanwhile there are still 40 or so unused one-time keys on the server, waiting to be claimed by other users.


The initial problem here is that the device tries to upload two (different) instances of AAAAWA. I'll look into what could have caused that, but what happened afterwards is an absolute catalogue of fail:

richvdh commented 7 years ago

I guess the double-upload of AAAAWA may have been a variant of element-hq/element-web#1209.

ghost commented 7 years ago

It's interesting that all keys fail because one key fails. Does it upload multiple keys in a single request, which is then rejected wholesale, or does it simply get stuck on the one key, and never tries to upload the other keys?

richvdh commented 7 years ago

it tries to upload all keys in one request.

ghost commented 7 years ago

Would it make sense to get a more granular response back from the server? ("These keys were accepted, these keys were rejected for reason x and these keys were rejected for reason y.")

richvdh commented 7 years ago

there woudn't be any harm in having the server generate a more helpful response, but it's deliberately transactional currently - either all the keys are accepted or none of them are. It's probably easier to throw away all the keys in the request if it gets rejected than go through picking and choosing.