Messages received while client closed not decrypted

afriedmanGlacier commented 2 years ago

Describe the bug When re-opening the client after it has been closed, during which time encrypted messages have been sent to that user, the messages are not decrypted.

Note: This was originally part of https://github.com/conversejs/converse.js/issues/2241 but seems to be a different issue.

To Reproduce Steps to reproduce the behavior:

Close the client
Send OMEMO messages to that user from another device (MUC or 1:1 but seen more frequently in MUC)
Re-open the client
Messages sent while client was closed and fetched by MAM upon opening cannot be decrypted.

Expected behavior The decryption issue does not happen 100% of the time, but quite often. Messages should always be able to be decrypted.

Environment (please complete the following information):

Desktop - We are using a version of converse-desktop that has been updated to 8.0.1 plus most of the latest commits. It's possible that the web client doesn't have the issue we are seeing (it seems like it could be a timing issue on opening).
We are using IndexedDB, ejabberd 21.01, MAM

Additional context After the update for getting sender real JID (https://github.com/conversejs/converse.js/commit/33f3cb76aded2312888fc7a5d493813f850548c5), we are correctly receiving the JID, but the decryption still often fails when opening the app for messages that arrived while closed. In looking at the message attributes during the decryption process, I noticed that it fails with no warning in plugins/omemo/utils.parseEncryptedMessage because it doesn't have attrs.encrypted.key.

In the attrs: "is_encrypted":true,"encrypted":{"device_id":"406865841"}. In other words it doesn't have the iv, key, etc.

However, the correct rid is clearly among the keys in the message. In headless/shared/parsers.getEncryptionAttributes, the line const device_id = _converse.omemo_store?.get('device_id'); doesn't return a device_id. The store appears to exist but device_id doesn't seem to be accessible yet when the first MAM messages come in for some reason.

In #2241 , @licaon-kter made the comment "Messages sent while Converse was not connected are almost surely not decrypted. Note: no key change, no IndexDB clean up. The old messages (already decrypted) are shown fine though." which sounds like it might be related to this particular error.

licaon-kter commented 2 years ago

Yup, seeing the same but didn't repro when I was using debug :)

Basically they do get decrypted (afaics), since the notification shows them just fine (!!!) but on screen I get a red text with an error, then the text disappears and nothing is left.

Will try to grab a log asap.

afriedmanGlacier commented 2 years ago

FYI, I ran the same test, but instead of closing the client, I turned wifi off to the computer so that the client disconnected. When I turned wifi on and the client reconnected, I got the same result (not able to decrypt) for messages sent to both MUC and 1:1 while disconnected. Why would omemo_store not know it's own device_id after just a disconnect with the client still open? That seems really strange.

afriedmanGlacier commented 2 years ago

It looks like disconnecting/reconnecting (for instance in headless/shared/connection.reconnect) call a tearDown in headless/core, which in turn calls _converse.api.trigger('afterTearDown'); which then calls delete _converse.omemo_store.

Then, I think at some point during the process but after MAM has returned values that can't be decrypted because the omemo_store isn't repopulated, restoreOMEMOSession is called. That's about as far as I can get, I hope that's helpful!

jcbrand commented 2 years ago

Thanks for your detailed bug report and comments @afriedmanGlacier, it helps a lot.

Based on everything you've wrote, your suspicion that it might be a race condition and looking at the code, I have a hunch as to what the issue is.

It could be that MAM messages arrive and start to get parsed, before the OMEMO session has been fully restored, which is why attrs.encryption.key is undefined.

I'm working on a fix.

afriedmanGlacier commented 2 years ago

Initial tests look good! Sent messages while client was closed and while client was disconnected. We'll test a bit more, but it looks promising!

jcbrand commented 2 years ago

@afriedmanGlacier: I'm done with the work in this ticket's branch.

Sounds like the issue is solved now? If so I'll merge the branch and close.

afriedmanGlacier commented 2 years ago

Yep! That should be good, thanks!

On Wed, Nov 24, 2021 at 4:46 PM JC Brand @.***> wrote:

@afriedmanGlacier https://github.com/afriedmanGlacier: I'm done with the work in this ticket's branch.

Sounds like the issue is solved now? If so I'll merge the branch and close.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/conversejs/converse.js/issues/2733#issuecomment-978281020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMALWQYSR2D3I4ROPJCRNTUNVMJ3ANCNFSM5ITU3C7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

futurealecks commented 2 years ago

Thanks everyone!

licaon-kter commented 2 years ago

So this is really fixed for you?

Still seeing this with f25f110e0ea2435c8d509e7ea72e905284470e62

I've written some messages to the Converse user while Converse was not online, so I login and see the errors: 1open

...but the notifications look fine: 2notificationisfine

Then the error text changes: 3errorchanges

And then it disappears totally.

Console, with debug says (kinda for each message):

Uncaught (in promise) MessageCounterError: Message key not found. The counter was repeated or the key was not filled.
    doDecryptWhisperMessage libsignal-protocol.js:1982
    promise callback*doDecryptWhisperMessage/< libsignal-protocol.js:1982
    promise callback*doDecryptWhisperMessage libsignal-protocol.js:1981
    decryptWithSessionList libsignal-protocol.js:1971
    decryptWhisperMessage libsignal-protocol.js:1973
    promise callback*decryptWhisperMessage/< libsignal-protocol.js:1972
    promise callback*queueJobForNumber libsignal-protocol.js:1995
    decryptWhisperMessage libsignal-protocol.js:1972
    Dw utils.js:386
    Dw utils.js:248
    e core.js:250
    promise callback*7002/hook/e< core.js:250
    hook core.js:250
    jd parsers.js:216
    messages utils.js:81
    wf utils.js:80
    Sf utils.js:146
    xf utils.js:215
    initialize index.js:53
    Ls events.js:281
    Ds events.js:264
    Os events.js:50
    trigger events.js:254
    trigger core.js:224
    announceReconnection model.js:313
    onReconnection model.js:320
    initialize index.js:48
    forEach collection.js:344
    initialize index.js:48
    Ls events.js:280
    Ds events.js:264
    Os events.js:50
    trigger events.js:254
    trigger core.js:224
    onConnected connection.js:157
    onConnectStatusChanged connection.js:282
    _changeConnectStatus core.js:2394
    _onSessionResultIQ core.js:3014
    run core.js:1264
    _dataRecv core.js:2526
    forEachChild core.js:316
    _dataRecv core.js:2515
    _onMessage websocket.js:506
libsignal-protocol.js:1982:151

afriedmanGlacier commented 2 years ago

Sorry, I have not been seeing that error. It's possible that we changed something in our local copy that affects that issue, but I'd have to go back and compare to see if that's true. When I get the chance to look, I'll let you know if I find anything that looks like it might be related. Perhaps put your comments above in another ticket?

afriedmanGlacier commented 2 years ago

Actually, we are seeing something like this again. I think it's the same issue. Can we re-open this ticket?

Again, we are using a version of converse-desktop that has been updated with most of the latest commits. We are using IndexedDB, websockets, ejabberd 21.01, MAM

When we lose connection due to desktop device going to sleep or loss of wifi, and messages are sent while there is no connection, and then we reconnect, we can still lose the messages in spite of the fixes in this ticket. This now seems to occur more in 1:1 messages than groups, and it may be due to timing. Here's the sequence I traced out:

After reconnection, "initOMEMO" is called in plugins/omemo/utils.js. At the end of this function, "api.trigger('OMEMOInitialized');" is called, and a number of other functions are supposed to wait for this before proceeding, but that doesn't seem to be the case.
Before initOMEMO is finished, a MAM message arrives and enters the parseEncryptedMessage function in plugins/omemo/utils.js, and from there calls "await api.omemo?.getDeviceID();"
getDeviceId (in plugins/omemo/api.js) has this line "await api.waitUntil('OMEMOInitialized'); and I think should wait for the initOMEMO function to finish and trigger "OMEMOInitialized" before proceeding, but it does not. I don't really understand JavaScript Promises and why the waitUntil function doesn't work, but it does not seem to be working correctly.
Because the initOMEMO function hasn't finished and getDeviceId doesn't wait for the OMEMOInitialized trigger, getDeviceId returns an error. It looks like:

ERROR: TypeError: Cannot read property 'get' of undefined at Object.getDeviceID Unhandled Promise rejection: Cannot read property 'get' of undefined ; Zone: ; Task: Promise.then ; Value: TypeError: Cannot read property 'get' of undefined at Object.getDeviceID (api.js:21) at async Object.parseEncryptedMessage [as callback] ...

When this happens, the message gets lost. For me, it brings up two questions:

Why isn't the "waitUntil" function working?
Is it necessary to "delete _converse.omemo_store" when connection is lost? What is the reasoning behind doing that? There's probably a good reason, but I figured I'd ask.

EDIT: for the 2nd question, I just realized that even if we don't delete omemo_store, the sessions probably also need to be rebuilt before it tries to decrypt?

afriedmanGlacier commented 2 years ago

I realized that promises are added to the api.promises array, but never removed. So, maybe once "OMEMOInitialized" happens the first time, the "waitUntil" method will always return immediately instead of waiting because the Promise has already been resolved? Maybe that Promise needs to be removed on disconnect and re-added?

EDIT: I ended up adding "api.promises.add(['OMEMOInitialized']);" right after "delete _converse.omemo_store;" after the connection teardown in order to replace the Promise with a new one. That worked to cause getDeviceId and other things to wait for OMEMOInitialized to happen again, but then it still failed in "decryptWhisperMessage". So I'm not sure if there are other things that need to be re-initialized or not.

licaon-kter commented 2 years ago

Yup, still an issue with 0f648ed1ebdfad4653e78a1e561faceab843a88e

afriedmanGlacier commented 2 years ago

@licaon-kter in my testing that latest commit does fix most of the issue under the hood, but the messages are still lost so its not any different for the user. I am hopeful that https://github.com/conversejs/converse.js/pull/2803 should complete the fix. It works in my testing, I look forward to seeing if it works for you.

licaon-kter commented 2 years ago

@afriedmanGlacier I'll rebuild...

jcbrand commented 2 years ago

@afriedmanGlacier: yesterday I made some small changes to what you did in #2803 to get the tests to pass.

The main issue was that the parsed messages need to be reassigned to result.messages otherwise other code which relies on the parsed messages break.

You might want to double check again that things work correctly for you.

See here: https://github.com/conversejs/converse.js/commit/46e6cfb0d4098d10f097eafb0369f148e56c0957

afriedmanGlacier commented 2 years ago

@jcbrand I tried almost that exact same thing before, and couldn't get it to work. Since then, I realized there are two different scenarios that react differently (and I was testing the one that didn't work).

1) With a loss of connection (such as WIFI going down), within a minute I'll start getting ERROR: Websocket closed unexcectedly messages. With the fixes you made, if I receive a message while offline and the connection comes back fairly quickly after the websocket closes, then I often have issues decrypting the message(s) again.

2) If the connection is closed for a longer period of time, when it comes back online, I am not losing messages and they are are decrypting correctly.

In almost all of my initial testing I was turning off the wifi and waiting for the Websocket closed log message (Scenario 1 above). I think this occurs when the client or server realizes the TCP connection has been lost. If connection is restored fairly quickly it will resume the session using smacks.

However, if the connection is gone for more than about 5 minutes, there is a message in converse saying that the session cannot be resumed and a new one will be created. There is a websocket_timeout configuration parameter in ejabberd that defaults to 5 minutes, or more likely a mod_stream_mgmt option of max_resume_timeout (also default 5 minutes) that I think cause this session reset.

In real life, the more likely scenario is the second one (being offline more than 5 minutes), such as when a computer is locked or goes to sleep overnight. In this case, in my testing so far, your fix seems to be working. However, if I receive a message after the websocket is closed but before the time period when the session is reset, it doesn't always work.

Another thing I noticed, if I am open to a conversation in the desktop app in scenario 1 described above, I am more likely to have the decryption issues on a message sent than if a message was sent to a different conversation.

I've tried various things and found one solution that seems to work in both/all cases. It's a bit different than your commit. Assuming it proves to be solid after we test it a bit more, I'll either submit a PR or just post it here to see what you think.

jcbrand commented 2 years ago

Thanks for the explanation @afriedmanGlacier.

I think the difference between scenario 1 and 2 might be that in scenario 1 the delayed message is not received via MAM.

If the session is resumed via SMACKS, then delayed messages are received as normal messages, not as MAM messages.

The fix you made recently was for MAM messages. So the fix for scenario 1 is likely somewhere else, unrelated to MAM. Probably again some kind of race condition.

licaon-kter commented 2 years ago

Sounds like what I saw, mentioned here: https://github.com/conversejs/converse.js/pull/2803#issuecomment-1029770720

afriedmanGlacier commented 2 years ago

Based on what I was seeing, that makes sense, but I think the race condition is actually related to MAM.

I noticed yesterday that messages were coming in via both MAM and normal method (testing scenario 1 above), but if I remember correctly, I think the MAM message was coming in more quickly and was not getting decrypted for some reason. Then when the message came in normally and was properly decrypted, it was considered a duplicate and was thrown out. I added something from my local testing to your fix, and it seems to work although I'm not sure it's the best solution.

In the handleMAMResult function of headless/plugins/mam/utils.js, I changed

messages.forEach(m => model.queueMessage(m));

to

messages.forEach(m => {
    if ((m.is_encrypted && m.plaintext) || !m.is_encrypted) {
        model.queueMessage(m);
    }
});

Thoughts?

jcbrand commented 2 years ago

I think the MAM message was coming in more quickly and was not getting decrypted for some reason.

That sounds like the real bug then. We need to figure out the "some reason", e.g. why certain messages don't get decrypted.

Concerning your fix, that seems to be a bandaid around the fact that the message doesn't get decrypted. Better than nothing, but not ideal because the underlying issue isn't fixed.

In your fix you're basically preventing the undecrypted message from being processed further and shown. You're however implicitly relying on the fact that the message will come again in a decryptable form, which isn't guaranteed.

jcbrand commented 2 years ago

Thinking about this some more... the double-ratchet mechanism of libsignal prevents the same message from being decrypted twice.

So it makes sense that when the same message comes in twice, once via MAM and once via SMACKS resumption, that decryption for one of them will fail. I think that might be the "some reason".

If this is the case, it means the first message comes in, gets decrypted, then the second comes in, can't get decrypted, but it then still overwrites the first message, thereby removing the decrypted plaintext.

In light of that, I think your fix might be on the right track. The code needs to be updated to be aware of the fact that an already decrypted message can't be decrypted again and it shouldn't be allowed to overwrite the already decrypted message (which seems to me is what's happening).

There is message deduplicating code already, perhaps that code needs to be updated to deal with this scenario as well.

afriedmanGlacier commented 2 years ago

Sounds like you have a much better idea of how to proceed than I do :)

jcbrand commented 2 years ago

I'll see if I can reproduce it with a test and come up with a fix.

jcbrand commented 2 years ago

@afriedmanGlacier Please check whether this commit fixes it for you: 906f8d63347b453d1d5b44dbe637f539d20d6966

EDIT: I noticed an error and force-pushed to the original commit

afriedmanGlacier commented 2 years ago

I was testing with the original push, and was about to respond that it wasn't working quite right, but I'll test the update tomorrow. Thanks!

afriedmanGlacier commented 2 years ago

Progress! But not quite there. Now it is decrypting, updating, and showing on screen briefly, but the message is still disappearing a few seconds later as if the is_ephemeral part isn't getting reset to false with the update.

I tried also adding this to the updated attributes, but it didn't help:

is_error: false,
type: attrs.type,

One thing I noticed is that the one message received after the session is resumed by SMACKS starts decrypting (getting into libsignal and thereby "winning" the race), but the other MAM message is received at almost the same time, fails and completes processing more quickly. So the message object that is saved is the failed one and should be updated, but I'm not sure why after it updates it still disappears as if it was ephemeral.

Every once in a while the non-MAM message finishes decryption before the MAM message starts processing, and then the message stays on screen and doesn't disappear, like we would expect.

afriedmanGlacier commented 2 years ago

I think I figured this last issue out. When Message is initialized (headles/plugins/chat/message.js), setTimerForEphemeralMessage is called. But this.on('change:is_ephemeral') is only set if the message is NOT ephemeral to start with. So if it is created with an error and is ephemeral to start with, the timer doesn't get cleared when it changes.

So I locally moved the below code outside of the if (this.isEphemeral()) conditional block, and it worked correctly. Was it inside only the else section for a reason? Will the change I made affect anything negatively?

this.on('change:is_ephemeral', () => 
        this.isEphemeral() ? setTimer() : clearTimeout(this.ephemeral_timer)
);

licaon-kter commented 2 years ago

Testing with HEAD (https://github.com/conversejs/converse.js/commit/0fb4dcac9693f6ce883daef8bfca23db94aba173) I saw the error again at one point, red text, notification with actual message decrypted.

After an OS restart I shared a link from another account to this user, then I logged in Converse.

That 1:1 chat had 2 pics shared.

Waiting for the pictures to appear the link appeared first, but was positioned above the pics, basically was shown 5h earlier than the actual post time

jcbrand commented 2 years ago

@afriedmanGlacier: Great work, thanks!

Was it inside only the else section for a reason? Will the change I made affect anything negatively?

Looks like a bug to me. I've pushed a commit that fixes it based on your analysis.

Please let me know whether this is enough to close this ticket now

afriedmanGlacier commented 2 years ago

Since you are listening for a change in type, would you need to update that along with the other attributes in getUpdatedMessageAttributes? On the same topic, does it make sense to update is_error to false in that case since we set the error attributes to undefined? Something like:

return Object.assign({}, attrs, {
        error_condition: undefined,
        error_message: undefined,
        error_text: undefined,
        error_type: undefined,
        is_archived: attrs.is_archived,
        is_error: false,
        type: attrs.type,
        is_ephemeral: false,
});

I'll test today and get back to you. Thanks!

jcbrand commented 2 years ago

type is already contained in attrs, so there's no need to explicitly state it again.

Looks like I missed is_error, so I added that as undefined, thanks for the tip.

afriedmanGlacier commented 2 years ago

Tests look good so far, I think this can be closed! Thank you @jcbrand!

licaon-kter commented 2 years ago

Looks okaish from here too

conversejs / converse.js

Messages received while client closed not decrypted #2733