element-hq / element-web

A glossy Matrix collaboration client for the web.
https://element.io
GNU Affero General Public License v3.0
11.22k stars 2k forks source link

Element-R | Resetting key backup takes a long time; and blocks the whole application #26892

Open BillCarsonFr opened 10 months ago

BillCarsonFr commented 10 months ago

Step to reproduce

=> Starting now if you type a message, it won't be sent after several minutes depending on the number of room keys in database

What is happening

See here https://github.com/element-hq/element-web/issues/26783#issuecomment-1867413139

Basically reseting backup will have to do a big write transaction that will lock the inbound_group_sessions2 object store for several minutes. It will also lock all the database due to the rust save_changes_lock that is taken when saving changes after a sync

=> Sync will be blocked, sending will be blocked

ANOTHER PROBLEM: If you close the tab or refresh the page, the transaction won't be retried so the keys won't be marked to be uploaded to the new backup

Difference with legacy

The inbound_group_sessions2 rust object store is bigger (encrypted byte array of lenght ~3700) than the legacy one (json with pickled key of around (string of lenght ~700). So the rust db gets slower faster

richvdh commented 10 months ago

Is this not a duplicate of #26873?

BillCarsonFr commented 10 months ago

Is this not a duplicate of #26873?

Could be one of the cause. This one is specifically after a backup reset.

richvdh commented 10 months ago

I actually meant https://github.com/element-hq/element-web/issues/26783.

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

richvdh commented 10 months ago

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

no, I misunderstood. The issue here is that the actual transaction to reset the key backup takes a very long time. (And if you restart during that time, everything is messed up)

andybalaam commented 9 months ago

Plan:

andybalaam commented 9 months ago

Previous plan doesn't work without a migration because we can't efficiently query for records that don't have a matching most_recent... field.

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1. We store against each inbound_group_session the order of the most recent backup that contains this session under e.g. backed_up_to. (0 == never backed up)

In the backup loop we query inbound_group_sessions where backed_up_to < current_backup_order.

The above would require a migration.

Since we don't want to combine this with a migration to reduce value size, we can avoid a further migration by preparing now.

Prep means: add the backed_up_to field and set it to -1. This means these records will appear in the backup loop, and we will migrate them during that loop (instead of in a migration).

The logic within the backup loop looks like::

if backed_up_to == -1:
    if needs_backup:
        backed_up_to = 0
    else:
        backed_up_to = 1

if backed_up_to < current_backup_order:
    back_up_this_session()
    backed_up_to = current_backup_order

[Updated to use backed_up_to:-1 instead of a separate property]

BillCarsonFr commented 9 months ago

Some remarks on

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1.

What happens in situations were new backups version are created (without deleting the previous ones) without the client having beeing on to be aware of them. Then we start to delete them.

Something like

The client is aware of version "1" and give it order 0 Then it misses version "2" and "3" But get version "4", it will have order 1 then.

Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2

I guess it still works? but can there be some edge cases due to that?

richvdh commented 9 months ago

Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2

That's fine. The most recent backup is then version "3", and we will try to upload all of our keys to it.

Of course, version "3" might already have all of our keys, so that would be somewhat wasted effort. But it'll sort itself out in the end.

andybalaam commented 9 months ago

Not blocking Element R release because if we do https://github.com/element-hq/element-web/issues/26930 we will be on-par with legacy crypto. (But we should still do this later.)

BillCarsonFr commented 9 months ago

FTR, with the new db format it took 3mn to reset all my keys (a bit less than 300k keys) instead of several 10s of minutes previously.

andybalaam commented 8 months ago

To speed up key backup resets:

To remove legacy code and data:

andybalaam commented 8 months ago

I just had a conversation with @poljar and @richvdh about how we can implement this. Here are my notes from our conversation:

Thinking through the case when this has never happened before:

andybalaam commented 7 months ago

Had a further conversation with @richvdh today where we realised we can simplify this a bit, but still make use of the backed_up_to field we added in the last schema upgrade.

No need for backup_versions mapping

We realised we didn't need the backup_versions mapping. It had 2 purposes:

  1. To make version "1" special: backed_up_to==-1 && !needs_backup => assume backed_up_to==1. Here 1 has the special meaning of "the first backup version we encounter after we start using the new schema". I describe below how we track the "pre-existing backup".
  2. To allow us to compare versions using > or <, to prevent us "downgrading" a session to be backed up by an older version. During our conversation today we decided this can never happen, as I describe below.

So instead of storing a backup order in backed_up_to we can directly store the backup_version in there. The name is already OK, and while it's slightly weird that backup_up_to is already populated with -1 it won't actually do any harm to store strings in there.

No need to compare orders

Previously, we were concerned about receiving an acknowledgement of a backed-up session for an old backup version, when actually a new backup was already in progress.

Today, we reviewed the code in backup() and mark_request_as_sent() in crates/matrix-sdk-crypto/src/backups/mod.rs and convinced ourselves that because the BackupMachine keeps track of the current backup request, it will always reject any outdated response because it doesn't match the request_id we are expecting. So old backup acks will be ignored.

Thus, when we receive a backup ack we can safely assume it is for the latest version, and store that version next to the session so we know it has been backed up. Also, when looking for sessions to back up we can also just look for rows whose backed_up_to != the supplied version.

We also considered the case where the user deletes an new backup and we start re-using an old one. In this case, the order comparison we suggested was working against us, because we would refuse to back up a session to an "older" backup than the current one, but in the case where the new one has been deleted, we actually want to back up to the older one. So again, the correct comparison is supplied_backup_version != backed_up_to.

Pre-existing backup

If we are looking for sessions to back up and we encounter one with backed_up_to==-1 this means it existed before we started writing to backed_up_to. (I.e. before the code I am working on now.)

In our previous idea we decided to treat backed_up_to==-1 && !needs_backup as if backed_up_to==1, and we knew that the first time we encountered a new backup_version we would set its order to 1.

Instead, we can simply track whether we are currently still on the first backup, or not. So, before reset_backup_state() is called, we assume that backed_up_to==-1 && !needs_backup means the session was backed up in the current backup, and otherwise it was not. So we only need special logic to handle backed_up_to==-1 when we have never reset the backup.

Note: when another client resets, deletes or adds a backup, we will call reset_backup_state(), which is exactly what we need here.

I will describe the code we will write in the next comment.

andybalaam commented 7 months ago

Code

DB migration

Create a store called backup_reset with no index.

inbound_group_sessions_for_backup

Check inside the backup_reset store for a backup_reset property. If it does not exist or its value is != true then set local variable backup_reset to false. Otherwise set it to true.

Search for sessions whose backed_up_to is != backup_version. On indexeddb this will require us to iterate everything < backup_version and then > backup_version but this is do-able.

If !backup_reset then filter the list, excluding entries where backed_up_to==-1 && !needs_backup.

Process and limit the list as before, and return it.

mark_inbound_group_sessions_as_backed_up

To mark a session as backed up, remove its needs_backup flag as before, and also stored backed_up_to=backup_version.

Add a comment to this code explaining that if we receive a call to this function with an old version then it can't be a race because of the request_id mechanism in BackupMachine, so it must be that the newer backup was deleted, so writing the supplied version is always correct.

reset_backup_state

Insert an entry into the backup_reset store with field backup_reset and set it to true. Optional: only do this if it is not already set to true.