Element-R | Resetting key backup takes a long time; and blocks the whole application

BillCarsonFr commented 10 months ago

Step to reproduce

Have an account with a lot of keys in local database, wait for all of them to be uploaded to backup
Go to Security & Privacy and click reset, and follow the step

=> Starting now if you type a message, it won't be sent after several minutes depending on the number of room keys in database

What is happening

See here https://github.com/element-hq/element-web/issues/26783#issuecomment-1867413139

Basically reseting backup will have to do a big write transaction that will lock the inbound_group_sessions2 object store for several minutes. It will also lock all the database due to the rust save_changes_lock that is taken when saving changes after a sync

=> Sync will be blocked, sending will be blocked

ANOTHER PROBLEM: If you close the tab or refresh the page, the transaction won't be retried so the keys won't be marked to be uploaded to the new backup

Difference with legacy

The inbound_group_sessions2 rust object store is bigger (encrypted byte array of lenght ~3700) than the legacy one (json with pickled key of around (string of lenght ~700). So the rust db gets slower faster

richvdh commented 10 months ago

Is this not a duplicate of #26873?

BillCarsonFr commented 10 months ago

Is this not a duplicate of #26873?

Could be one of the cause. This one is specifically after a backup reset.

richvdh commented 10 months ago

I actually meant https://github.com/element-hq/element-web/issues/26783.

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

richvdh commented 10 months ago

But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.

no, I misunderstood. The issue here is that the actual transaction to reset the key backup takes a very long time. (And if you restart during that time, everything is messed up)

andybalaam commented 9 months ago

Plan:

Change "needs_backup" field into "most_recent_backup_to_include_this" so we don't need to modify all records when the current backup version changes.
Reduce the size of the values in inbound_group_sessions2 so everything is faster. Do this by avoiding so many JSON re-encodes.

andybalaam commented 9 months ago

Previous plan doesn't work without a migration because we can't efficiently query for records that don't have a matching most_recent... field.

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1. We store against each inbound_group_session the order of the most recent backup that contains this session under e.g. backed_up_to. (0 == never backed up)

In the backup loop we query inbound_group_sessions where backed_up_to < current_backup_order.

The above would require a migration.

Since we don't want to combine this with a migration to reduce value size, we can avoid a further migration by preparing now.

Prep means: add the backed_up_to field and set it to -1. This means these records will appear in the backup loop, and we will migrate them during that loop (instead of in a migration).

The logic within the backup loop looks like::

if backed_up_to == -1:
    if needs_backup:
        backed_up_to = 0
    else:
        backed_up_to = 1

if backed_up_to < current_backup_order:
    back_up_this_session()
    backed_up_to = current_backup_order

[Updated to use backed_up_to:-1 instead of a separate property]

BillCarsonFr commented 9 months ago

Some remarks on

Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1.

What happens in situations were new backups version are created (without deleting the previous ones) without the client having beeing on to be aware of them. Then we start to delete them.

Something like

Create a version "1"
Create a version "2"
Create a version "3"
Create a version "4"

The client is aware of version "1" and give it order 0 Then it misses version "2" and "3" But get version "4", it will have order 1 then.

Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2

I guess it still works? but can there be some edge cases due to that?

richvdh commented 9 months ago

Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2

That's fine. The most recent backup is then version "3", and we will try to upload all of our keys to it.

Of course, version "3" might already have all of our keys, so that would be somewhat wasted effort. But it'll sort itself out in the end.

andybalaam commented 9 months ago

Not blocking Element R release because if we do https://github.com/element-hq/element-web/issues/26930 we will be on-par with legacy crypto. (But we should still do this later.)

BillCarsonFr commented 9 months ago

FTR, with the new db format it took 3mn to reset all my keys (a bit less than 300k keys) instead of several 10s of minutes previously.

andybalaam commented 8 months ago

To speed up key backup resets:

[ ] Store current backup order number and ID in Indexed DB, probably in the core store, or in backup_keys. Default it to 1 and the current backup ID if it is empty.
[ ] When we receive a new backup version (or when we reset backup?), increment the current backup order number and change the current ID to match the new ID
[ ] In the backup loop, follow the algorithm from my above comment

To remove legacy code and data:

[ ] Wait until most users have probably migrated most of their keys
[ ] Change the logic from the above comment to remove the first if. This means we don't need the not_backed_up property any more. Any non-migrated keys will be backed up again, but this won't do harm except being slow.
[ ] Create a DB migration that removes the not_backed_up property and its related index.

andybalaam commented 8 months ago

I just had a conversation with @poljar and @richvdh about how we can implement this. Here are my notes from our conversation:

Modify inbound_group_sessions_for_backup and mark_inbound_group_sessions_as_backed_up so they take in the backup version. We checked where they are called, and we do have access to the version in both places. It's stored somewhere deep inside PendingBackupRequest. We discussed and agreed that this is a sensible API change for all crypto storage backends, because in both cases we are talking about the "current backup", which was implicit, and this allows us to be more explicit (maybe even avoiding races) by saying exactly which backup we mean. https://github.com/element-hq/crypto-internal/issues/206
Create a new object store backup_versions in Indexed DB that will hold a mapping of backup version -> backup order number. We decided we should keep all of them, not just the latest, so that if some old backup is still running we won't somehow understand that as a new backup and bump our version again https://github.com/element-hq/crypto-internal/issues/207
(Added by Andy later) Validate the algorithm using the MemoryStore https://github.com/element-hq/crypto-internal/issues/208
Implement the algorithm https://github.com/element-hq/crypto-internal/issues/209:
- In mark_inbound_group_sessions_as_backed_up:
  - check the version we have been given against the versions we know in backup_versions.
  - If it doesn't exist in there, add it, giving it an order number greater than all the other entries.
  - Look up the order number for this version, and write that into the backed_up_to field for each entry we are saving.
- In inbound_group_sessions_for_backup:
  - look up the version we have been given in backup_versions.
  - If it exists, use its order as current_backup_order in the algorithm from the above comment.
  - If it doesn't exist then this is a new backup: just return a batch of any entries, since none of them are backed up yet in this new backup. (The entry will get added when the batch succeeds, and mark_inbound_group_sessions_as_backed_up is called.)

Thinking through the case when this has never happened before:

First we run a schema upgrade to add the backup_versions object store
Then we call inbound_group_sessions_for_backup and find no entry for the current backup. In this special case where backup_versions is empty, we should default to 1 as its order number. Then the algorithm from the comment above will still work.
Then we call mark_inbound_group_sessions_as_backed_up. We will add an entry to backup_versions for this backup version. We should choose order number 1 because there are no other entries in the store. From now on things should work as planned because order number 1 means "the backup was current when we started this process" when we don't know better.

andybalaam commented 7 months ago

Had a further conversation with @richvdh today where we realised we can simplify this a bit, but still make use of the backed_up_to field we added in the last schema upgrade.

No need for backup_versions mapping

We realised we didn't need the backup_versions mapping. It had 2 purposes:

To make version "1" special: backed_up_to==-1 && !needs_backup => assume backed_up_to==1. Here 1 has the special meaning of "the first backup version we encounter after we start using the new schema". I describe below how we track the "pre-existing backup".
To allow us to compare versions using > or <, to prevent us "downgrading" a session to be backed up by an older version. During our conversation today we decided this can never happen, as I describe below.

So instead of storing a backup order in backed_up_to we can directly store the backup_version in there. The name is already OK, and while it's slightly weird that backup_up_to is already populated with -1 it won't actually do any harm to store strings in there.

No need to compare orders

Previously, we were concerned about receiving an acknowledgement of a backed-up session for an old backup version, when actually a new backup was already in progress.

Today, we reviewed the code in backup() and mark_request_as_sent() in crates/matrix-sdk-crypto/src/backups/mod.rs and convinced ourselves that because the BackupMachine keeps track of the current backup request, it will always reject any outdated response because it doesn't match the request_id we are expecting. So old backup acks will be ignored.

Thus, when we receive a backup ack we can safely assume it is for the latest version, and store that version next to the session so we know it has been backed up. Also, when looking for sessions to back up we can also just look for rows whose backed_up_to != the supplied version.

We also considered the case where the user deletes an new backup and we start re-using an old one. In this case, the order comparison we suggested was working against us, because we would refuse to back up a session to an "older" backup than the current one, but in the case where the new one has been deleted, we actually want to back up to the older one. So again, the correct comparison is supplied_backup_version != backed_up_to.

Pre-existing backup

If we are looking for sessions to back up and we encounter one with backed_up_to==-1 this means it existed before we started writing to backed_up_to. (I.e. before the code I am working on now.)

In our previous idea we decided to treat backed_up_to==-1 && !needs_backup as if backed_up_to==1, and we knew that the first time we encountered a new backup_version we would set its order to 1.

Instead, we can simply track whether we are currently still on the first backup, or not. So, before reset_backup_state() is called, we assume that backed_up_to==-1 && !needs_backup means the session was backed up in the current backup, and otherwise it was not. So we only need special logic to handle backed_up_to==-1 when we have never reset the backup.

Note: when another client resets, deletes or adds a backup, we will call reset_backup_state(), which is exactly what we need here.

I will describe the code we will write in the next comment.

andybalaam commented 7 months ago

Code

DB migration

Create a store called backup_reset with no index.

inbound_group_sessions_for_backup

Check inside the backup_reset store for a backup_reset property. If it does not exist or its value is != true then set local variable backup_reset to false. Otherwise set it to true.

Search for sessions whose backed_up_to is != backup_version. On indexeddb this will require us to iterate everything < backup_version and then > backup_version but this is do-able.

If !backup_reset then filter the list, excluding entries where backed_up_to==-1 && !needs_backup.

Process and limit the list as before, and return it.

mark_inbound_group_sessions_as_backed_up

To mark a session as backed up, remove its needs_backup flag as before, and also stored backed_up_to=backup_version.

Add a comment to this code explaining that if we receive a call to this function with an old version then it can't be a race because of the request_id mechanism in BackupMachine, so it must be that the newer backup was deleted, so writing the supplied version is always correct.

reset_backup_state

Insert an entry into the backup_reset store with field backup_reset and set it to true. Optional: only do this if it is not already set to true.

element-hq / element-web