Open BillCarsonFr opened 10 months ago
Is this not a duplicate of #26873?
Is this not a duplicate of #26873?
Could be one of the cause. This one is specifically after a backup reset.
I actually meant https://github.com/element-hq/element-web/issues/26783.
But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.
But the issue is "Message sending is slow when there a lot of keys to back up", right? We should combine the two issues, imho.
no, I misunderstood. The issue here is that the actual transaction to reset the key backup takes a very long time. (And if you restart during that time, everything is messed up)
Plan:
Previous plan doesn't work without a migration because we can't efficiently query for records that don't have a matching most_recent...
field.
Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order
value to the previous order + 1. We store against each inbound_group_session the order of the most recent backup that contains this session under e.g. backed_up_to
. (0 == never backed up)
In the backup loop we query inbound_group_sessions where backed_up_to < current_backup_order
.
The above would require a migration.
Since we don't want to combine this with a migration to reduce value size, we can avoid a further migration by preparing now.
Prep means: add the backed_up_to
field and set it to -1. This means these records will appear in the backup loop, and we will migrate them during that loop (instead of in a migration).
The logic within the backup loop looks like::
if backed_up_to == -1:
if needs_backup:
backed_up_to = 0
else:
backed_up_to = 1
if backed_up_to < current_backup_order:
back_up_this_session()
backed_up_to = current_backup_order
[Updated to use backed_up_to:-1 instead of a separate property]
Some remarks on
Next idea: assign each backup version an order number, so each time we see a new backup version from the server, we set a current_backup_order value to the previous order + 1.
What happens in situations were new backups version are created (without deleting the previous ones) without the client having beeing on to be aware of them. Then we start to delete them.
Something like
The client is aware of version "1" and give it order 0 Then it misses version "2" and "3" But get version "4", it will have order 1 then.
Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2
I guess it still works? but can there be some edge cases due to that?
Now version "4" is deleted, making the old "3" as current. The client gets to know it and will give it order 2
That's fine. The most recent backup is then version "3", and we will try to upload all of our keys to it.
Of course, version "3" might already have all of our keys, so that would be somewhat wasted effort. But it'll sort itself out in the end.
Not blocking Element R release because if we do https://github.com/element-hq/element-web/issues/26930 we will be on-par with legacy crypto. (But we should still do this later.)
FTR, with the new db format it took 3mn to reset all my keys (a bit less than 300k keys) instead of several 10s of minutes previously.
To speed up key backup resets:
core
store, or in backup_keys
. Default it to 1 and the current backup ID if it is empty.To remove legacy code and data:
if
. This means we don't need the not_backed_up
property any more. Any non-migrated keys will be backed up again, but this won't do harm except being slow.not_backed_up
property and its related index.I just had a conversation with @poljar and @richvdh about how we can implement this. Here are my notes from our conversation:
inbound_group_sessions_for_backup
and mark_inbound_group_sessions_as_backed_up
so they take in the backup version. We checked where they are called, and we do have access to the version in both places. It's stored somewhere deep inside PendingBackupRequest
. We discussed and agreed that this is a sensible API change for all crypto storage backends, because in both cases we are talking about the "current backup", which was implicit, and this allows us to be more explicit (maybe even avoiding races) by saying exactly which backup we mean. https://github.com/element-hq/crypto-internal/issues/206backup_versions
in Indexed DB that will hold a mapping of backup version -> backup order number. We decided we should keep all of them, not just the latest, so that if some old backup is still running we won't somehow understand that as a new backup and bump our version again https://github.com/element-hq/crypto-internal/issues/207MemoryStore
https://github.com/element-hq/crypto-internal/issues/208mark_inbound_group_sessions_as_backed_up
:
backup_versions
.backed_up_to
field for each entry we are saving.inbound_group_sessions_for_backup
:
backup_versions
.current_backup_order
in the algorithm from the above comment.mark_inbound_group_sessions_as_backed_up
is called.)Thinking through the case when this has never happened before:
backup_versions
object storeinbound_group_sessions_for_backup
and find no entry for the current backup. In this special case where backup_versions
is empty, we should default to 1 as its order number. Then the algorithm from the comment above will still work.mark_inbound_group_sessions_as_backed_up
. We will add an entry to backup_versions
for this backup version. We should choose order number 1 because there are no other entries in the store. From now on things should work as planned because order number 1 means "the backup was current when we started this process" when we don't know better.Had a further conversation with @richvdh today where we realised we can simplify this a bit, but still make use of the backed_up_to
field we added in the last schema upgrade.
We realised we didn't need the backup_versions
mapping. It had 2 purposes:
backed_up_to==-1 && !needs_backup
=> assume backed_up_to==1
. Here 1
has the special meaning of "the first backup version we encounter after we start using the new schema". I describe below how we track the "pre-existing backup".>
or <
, to prevent us "downgrading" a session to be backed up by an older version. During our conversation today we decided this can never happen, as I describe below.So instead of storing a backup order in backed_up_to
we can directly store the backup_version
in there. The name is already OK, and while it's slightly weird that backup_up_to
is already populated with -1
it won't actually do any harm to store strings in there.
Previously, we were concerned about receiving an acknowledgement of a backed-up session for an old backup version, when actually a new backup was already in progress.
Today, we reviewed the code in backup()
and mark_request_as_sent()
in crates/matrix-sdk-crypto/src/backups/mod.rs
and convinced ourselves that because the BackupMachine
keeps track of the current backup request, it will always reject any outdated response because it doesn't match the request_id
we are expecting. So old backup acks will be ignored.
Thus, when we receive a backup ack we can safely assume it is for the latest version, and store that version next to the session so we know it has been backed up. Also, when looking for sessions to back up we can also just look for rows whose backed_up_to
!= the supplied version.
We also considered the case where the user deletes an new backup and we start re-using an old one. In this case, the order comparison we suggested was working against us, because we would refuse to back up a session to an "older" backup than the current one, but in the case where the new one has been deleted, we actually want to back up to the older one. So again, the correct comparison is supplied_backup_version != backed_up_to
.
If we are looking for sessions to back up and we encounter one with backed_up_to==-1
this means it existed before we started writing to backed_up_to
. (I.e. before the code I am working on now.)
In our previous idea we decided to treat backed_up_to==-1 && !needs_backup
as if backed_up_to==1
, and we knew that the first time we encountered a new backup_version we would set its order to 1.
Instead, we can simply track whether we are currently still on the first backup, or not. So, before reset_backup_state()
is called, we assume that backed_up_to==-1 && !needs_backup
means the session was backed up in the current backup, and otherwise it was not. So we only need special logic to handle backed_up_to==-1
when we have never reset the backup.
Note: when another client resets, deletes or adds a backup, we will call reset_backup_state()
, which is exactly what we need here.
I will describe the code we will write in the next comment.
Create a store called backup_reset
with no index.
Check inside the backup_reset
store for a backup_reset
property. If it does not exist or its value is != true then set local variable backup_reset
to false. Otherwise set it to true.
Search for sessions whose backed_up_to
is != backup_version
. On indexeddb this will require us to iterate everything < backup_version
and then > backup_version
but this is do-able.
If !backup_reset
then filter the list, excluding entries where backed_up_to==-1 && !needs_backup
.
Process and limit the list as before, and return it.
To mark a session as backed up, remove its needs_backup
flag as before, and also stored backed_up_to=backup_version
.
Add a comment to this code explaining that if we receive a call to this function with an old version then it can't be a race because of the request_id
mechanism in BackupMachine
, so it must be that the newer backup was deleted, so writing the supplied version is always correct.
Insert an entry into the backup_reset
store with field backup_reset
and set it to true. Optional: only do this if it is not already set to true.
Step to reproduce
=> Starting now if you type a message, it won't be sent after several minutes depending on the number of room keys in database
What is happening
See here https://github.com/element-hq/element-web/issues/26783#issuecomment-1867413139
Basically reseting backup will have to do a big write transaction that will lock the
inbound_group_sessions2
object store for several minutes. It will also lock all the database due to the rustsave_changes_lock
that is taken when saving changes after a sync=> Sync will be blocked, sending will be blocked
ANOTHER PROBLEM: If you close the tab or refresh the page, the transaction won't be retried so the keys won't be marked to be uploaded to the new backup
Difference with legacy
The
inbound_group_sessions2
rust object store is bigger (encrypted byte array of lenght ~3700) than the legacy one (json with pickled key of around (string of lenght ~700). So the rust db gets slower faster