element-hq / element-desktop

A glossy Matrix collaboration client for desktop.
https://element.io
GNU Affero General Public License v3.0
1.13k stars 260 forks source link

"Unable to restore session" due to failure loading OlmAccount from indexedDB #1816

Open richvdh opened 4 months ago

richvdh commented 4 months ago

We've seen a few user reports of "Unable to restore session" errors with things like the following in the logs:

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer `123`, expected a string at line 1 column 4
2024-04-30T06:35:34.559Z E Unable to load session invalid type: map, expected a sequence at line 1 column 0
2024-04-30T00:39:25.404Z E Unable to load session invalid type: null, expected a sequence at line 1 column 4

All of these suggest a problem deserialising the pickled account data in the core.account row in indexeddb. The first one suggests that the pickleKey has been deleted from the matrix-react-sdk indexeddb. The other two suggest some sort of malformed data in the matrix-js-sdk::matrix-sdk-crypto store. I have no real ideas how we're getting into this state.

richvdh commented 3 months ago

We suspect some of these failure modes might be an interaction with user being a guest and then trying to log in later.

KrisVos130 commented 3 months ago

I have had this issue with Element Desktop about a dozen times over the past year or so. I submitted 3 rageshakes to https://github.com/element-hq/element-desktop/issues/1632.

As I get this issue every couple of weeks, if there's any additional logging I can turn on to help fix this issue, or answer any questions, please let me know.

kegsay commented 3 months ago

I've just seen your bug report: thank you so much for being detailed on what you did. Thanks to this, I can clearly see that Unable to load session invalid type: integer123`` is only happening on Electron (Desktop) clients.

This can be reproduced by deleting the pickleKey object store in the matrix-react-sdk database and then refreshing the client. What we're concerned about is whether this is A) uniquely caused by the new rust crypto stack, and if not B) is more frequent with the new rust crypto stack.

You say that you've had this "over the past year" - do you recall the earliest time and did you send a bug report when this happened? The earliest report I see from that user ID is Apr 22.

Also, do you ever try nightly builds of Element Desktop? How frequently do you upgrade versions? I'm wondering if you could be hitting https://github.com/electron/electron/issues/27734 ?

KrisVos130 commented 3 months ago

I did some digging in my own notes and chats, but couldn't find exact dates of every time it happened unfortunately. I see I sent a rageshake on 2023-07-22 around 9am CEST to https://github.com/vector-im/element-desktop/issues/692 (which I had completely forgotten about). Many other times, I can imagine I either didn't send a rageshake due to privacy concerns, or I was unable to.

So I don't know the earliest time this exact issue happened to me, but the earliest time I can recall and sent a rageshake was 2023-07-22.

I also see my first report in the issue on element-desktop still uses Olm, so I assume that report and the times I experienced it before that were not using the rust crypto stack. Could be wrong of course. I also don't remember if I enabled rust crypto in labs or not. I doubt it, but can't be sure.

I haven't noticed an increase of this bug recently. Once every couple of weeks.

I never use the nightly builds of Element Desktop. I update every 1-3 weeks, and as the package in the AUR is not usually updated right away, it's not rare to be 1-2 versions behind. In my latest report using the official tarballs, I hadn't updated Element in weeks, and hadn't updated packages in general in a few days. I also never downgrade Element, or mix and match different versions.

I have a few profiles in Element, but have exclusively used the default one for months.

KrisVos130 commented 3 months ago

I did some more digging, and cross-referenced inactive sessions with some logging I have, and was able to get the dates of every time I had this issue in the past year or so. I am pretty confident these are all of them. It also appears that I overestimated how often this happened total. I have likely had this outcome a dozen times, but not in the past year, and with multiple different causes.

Looking at the dates, it does appear that the issue started happening more often more recently. I don't know what versions/dates the rust crypto became available/default, so can't comment on the relation to that.

KrisVos130 commented 3 months ago

...and hopefully this will be my last message for a little while, but just had it happen again (rageshake sent). I was looking through the various tables in IndexedDB via devtools, noticed that the matrix-react-sdk pickleKey table was empty, was curious whether that meant it would be unable to restore the session, and sure enough. I was careful to not delete anything, so I didn't accidentally delete/edit/update/add anything in there. And it appears to always be empty, so even now with it being empty I can exit and re-open Element just fine...

And probably unrelated to this issue, but downloading logs via the submit debug logs crashed Element desktop twice for me.

kegsay commented 3 months ago

Thanks, this is all very valuable information. I would be very interested to know if any other tables are empty when this happens. Specifically:

kegsay commented 3 months ago

Worth noting that for Unable to load session invalid type: null, expected a sequence at line 1 column 4 we expect https://github.com/element-hq/element-web/issues/27520 is the root cause (browser downgrades, electron downgrades). Some users who reported that their computer crashed see this as well.

The missing pickleKey (error Unable to load session invalid type: integer123, expected a string at line 1 column 4) suggests things got nuked from the DB. Looking at Chrome Web Storage design docs, there is a clear difference between indexedDB and localStorage internally:

When an origin is evicted, all of its quota-managed data (Indexed DB, Cache API, etc) is deleted.

Screenshot 2024-06-18 at 15 04 40

This implies that indexedDB could be wiped but not local or session storage. This would put the app into an inconsistent state. This is at odds with MDN Eviction docs which states:

When an origin's data is evicted by the browser, all of its data, not parts of it, is deleted at the same time. If the origin had stored data by using IndexedDB and the Cache API for example, then both types of data are deleted.

Only deleting some of the origin's data could cause inconsistency problems.

If the app is in an inconsistent state I could see this causing some of these errors. Separately, why now. What is different with EW-R that is causing us to see this more frequently?

KrisVos130 commented 2 months ago

Just experienced the issue again.

Screenshot_20240627_061609

I took some screenshots of a few of the mentioned tables, so I can censor those and send those too if needed. And I made a backup of the .config/Electron , so could possibly answer additional questions regarding content of IndexedDB/Local Storage/Session Storage without waiting for the bug to happen (though I imagine parsing it won't be the easiest).

kegsay commented 2 months ago

Amazing, thank you! I'll look into this soon.

kegsay commented 2 months ago

Also seen:

kegsay commented 2 months ago

It also looks like this may have been an issue for a while - https://github.com/element-hq/element-web/issues/9338 seems eerily similar.

kegsay commented 2 months ago

My best guess on this is that hard crashes are corrupting the database. This has increased with rust SDK because the indexedDB implementation does not ask for durable writes because it can't.

I'm guessing that Electron is more affected because there's more opportunity for hard crashes - restarting a machine with chrome restores the tabs which I wonder if that behaves differently to closing an app in Electron.

richvdh commented 1 month ago

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer 123, expected a string at line 1 column 4

I believe this is what will happen on Element Desktop, if the pickle key is not provided by the system keychain when Element starts. (When you first log in, a key is stored in the system keyring, which is then used to encrypt the E2EE database. If that key is removed from the system keychain, then you'll get this error.) Certainly that is what seems to have happened in @KrisVos130's report of 2024-06-11. The clue is this line in the logs:

2024-06-11T17:54:24.990Z E Expected a pickle key, but none provided.  Encryption may not work.
richvdh commented 1 month ago

Also seen:

  • Failed to read large IndexedDB value: twice from different users on Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Element/1.11.69 Chrome/124.0.6367.207 Electron/30.0.5 Safari/537.36 - which appears to be "Electron Application 30 on Linux". One report mentions that they had a hard crash. See https://issues.chromium.org/issues/342779913 ?

Tracking this over at element-hq/element-web#27870 because I think it's different.

richvdh commented 1 month ago

I strongly suspect that at least Unable to load session invalid type: integer `123`, expected a string is related to https://github.com/element-hq/element-desktop/issues/1077. I think:

t3chguy commented 1 month ago

My best guess on this is that hard crashes are corrupting the database. This has increased with rust SDK because the indexedDB implementation does not ask for durable writes https://github.com/rustwasm/wasm-bindgen/issues/3959.

This issue has been closed, can we now use durable writes?

richvdh commented 1 month ago

Yes, I believe so. That's tracked over at https://github.com/matrix-org/matrix-rust-sdk/issues/3354.

I'm a bit unconvinced that indexeddb corruption is the actual problem here.

KrisVos130 commented 1 month ago

I have been using 1.11.68 for a month or so, and just downloaded 1.11.72 as it has some extra logging for this or similar issues. The first time I opened 1.11.72, I got the "Unable to restore session" screen.
I am not 100% sure if it is the same issue, or of it is related to updating, could be a coincidence.

The "Unable to load session" error is slightly different than the three listed in comment 1 of this issue, Unable to load session invalid type: sequence, expected a string at line 1 column 0. Previously, I think I only had the 123 one.

I also saw the following in journalctl:

element-desktop[6475]: Error getting the event index passphrase out of the secret store [Error: Can't find session /org/freedesktop/secrets/session/3]

I also saw something about the problem potentially being with the interaction with the wallet. So for some extra context, in the kdewallet, I can see passwords/keys for the sessions that have failed to load over the past few months. So if the issue is with the wallet, the password still exists at the very least. I also have had no issues with any other program so far.

richvdh commented 1 month ago

It's becoming clear to me that these are three completely different root causes.

2024-04-27T15:40:54.673Z E Unable to load session invalid type: integer `123`, expected a string at line 1 column 4

This is what used to happen when the database was encrypted, but the picklekey was missing: the account is stored as an array of bytes containing the json serialization of an EncryptedValueBase64, which starts [123 (123 == {) but the code is expecting to find an object corresponding to the JSON serialization of a PickledAccount. (Apparently serde_json doesn't mind that it is an array rather than an object.)

As of v1.11.70 (which does a bit more checking on the values it is trying to deserialise, this may be replaced by simply Unable to load session Error: An object failed to be decrypted while unpickling. Either:

2024-04-30T06:35:34.559Z E Unable to load session invalid type: map, expected a sequence at line 1 column 0

This is, probably, the opposite problem: the database is not encrypted, but we have a picklekey. The code is expecting to find an array of bytes, but in fact we have an unencrypted JSON serialization of a PickledAccount.

Again, this has changed as of v1.11.70: it will always report "An object failed to be decrypted while unpickling".

2024-04-30T00:39:25.404Z E Unable to load session invalid type: null, expected a sequence at line 1 column 4

This is a bit odd. Again, it happens when we do have a picklekey, and one thing that could cause it is when the stored "account" is just null, rather than an encrypted object. Once more, this will become "An object failed to be decrypted while unpickling" from v1.11.70.

It's rather unfortunate that everything will end up with the same error from v1.11.70 :/

richvdh commented 1 month ago

@KrisVos130: thanks for the update. Yes, this does look like basically the same problem; v1.11.70 added a database migration step, and if that step runs when the encryption key is missing it will spit out that error.

richvdh commented 1 month ago

I notice that this only seems to happen on the electron platform. I'll transfer it to the element-desktop repo.

richvdh commented 1 month ago

Also, I'm pretty sure it's related to https://github.com/element-hq/element-desktop/issues/1077.

richvdh commented 1 month ago

I think my next step here is going to be to add some sanity-checking to the bit where we try to load the access token from the database. Currently I think that is failing, but the error is being silently swallowed, and then we get this more cryptic error from the crypto stack. At least if we fail earlier we can give better error messages in the logs.