Closed afriedmanGlacier closed 2 years ago
Yup, seeing the same but didn't repro when I was using debug :)
Basically they do get decrypted (afaics), since the notification shows them just fine (!!!) but on screen I get a red text with an error, then the text disappears and nothing is left.
Will try to grab a log asap.
FYI, I ran the same test, but instead of closing the client, I turned wifi off to the computer so that the client disconnected. When I turned wifi on and the client reconnected, I got the same result (not able to decrypt) for messages sent to both MUC and 1:1 while disconnected. Why would omemo_store
not know it's own device_id
after just a disconnect with the client still open? That seems really strange.
It looks like disconnecting/reconnecting (for instance in headless/shared/connection.reconnect
) call a tearDown
in headless/core, which in turn calls _converse.api.trigger('afterTearDown');
which then calls delete _converse.omemo_store
.
Then, I think at some point during the process but after MAM has returned values that can't be decrypted because the omemo_store isn't repopulated, restoreOMEMOSession
is called. That's about as far as I can get, I hope that's helpful!
Thanks for your detailed bug report and comments @afriedmanGlacier, it helps a lot.
Based on everything you've wrote, your suspicion that it might be a race condition and looking at the code, I have a hunch as to what the issue is.
It could be that MAM messages arrive and start to get parsed, before the OMEMO session has been fully restored, which is why attrs.encryption.key
is undefined.
I'm working on a fix.
Initial tests look good! Sent messages while client was closed and while client was disconnected. We'll test a bit more, but it looks promising!
@afriedmanGlacier: I'm done with the work in this ticket's branch.
Sounds like the issue is solved now? If so I'll merge the branch and close.
Yep! That should be good, thanks!
On Wed, Nov 24, 2021 at 4:46 PM JC Brand @.***> wrote:
@afriedmanGlacier https://github.com/afriedmanGlacier: I'm done with the work in this ticket's branch.
Sounds like the issue is solved now? If so I'll merge the branch and close.
— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/conversejs/converse.js/issues/2733#issuecomment-978281020, or unsubscribe https://github.com/notifications/unsubscribe-auth/AFMALWQYSR2D3I4ROPJCRNTUNVMJ3ANCNFSM5ITU3C7A . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.
Thanks everyone!
So this is really fixed for you?
Still seeing this with f25f110e0ea2435c8d509e7ea72e905284470e62
I've written some messages to the Converse user while Converse was not online, so I login and see the errors:
...but the notifications look fine:
Then the error text changes:
And then it disappears totally.
Console, with debug says (kinda for each message):
Uncaught (in promise) MessageCounterError: Message key not found. The counter was repeated or the key was not filled.
doDecryptWhisperMessage libsignal-protocol.js:1982
promise callback*doDecryptWhisperMessage/< libsignal-protocol.js:1982
promise callback*doDecryptWhisperMessage libsignal-protocol.js:1981
decryptWithSessionList libsignal-protocol.js:1971
decryptWhisperMessage libsignal-protocol.js:1973
promise callback*decryptWhisperMessage/< libsignal-protocol.js:1972
promise callback*queueJobForNumber libsignal-protocol.js:1995
decryptWhisperMessage libsignal-protocol.js:1972
Dw utils.js:386
Dw utils.js:248
e core.js:250
promise callback*7002/hook/e< core.js:250
hook core.js:250
jd parsers.js:216
messages utils.js:81
wf utils.js:80
Sf utils.js:146
xf utils.js:215
initialize index.js:53
Ls events.js:281
Ds events.js:264
Os events.js:50
trigger events.js:254
trigger core.js:224
announceReconnection model.js:313
onReconnection model.js:320
initialize index.js:48
forEach collection.js:344
initialize index.js:48
Ls events.js:280
Ds events.js:264
Os events.js:50
trigger events.js:254
trigger core.js:224
onConnected connection.js:157
onConnectStatusChanged connection.js:282
_changeConnectStatus core.js:2394
_onSessionResultIQ core.js:3014
run core.js:1264
_dataRecv core.js:2526
forEachChild core.js:316
_dataRecv core.js:2515
_onMessage websocket.js:506
libsignal-protocol.js:1982:151
Sorry, I have not been seeing that error. It's possible that we changed something in our local copy that affects that issue, but I'd have to go back and compare to see if that's true. When I get the chance to look, I'll let you know if I find anything that looks like it might be related. Perhaps put your comments above in another ticket?
Actually, we are seeing something like this again. I think it's the same issue. Can we re-open this ticket?
Again, we are using a version of converse-desktop that has been updated with most of the latest commits. We are using IndexedDB, websockets, ejabberd 21.01, MAM
When we lose connection due to desktop device going to sleep or loss of wifi, and messages are sent while there is no connection, and then we reconnect, we can still lose the messages in spite of the fixes in this ticket. This now seems to occur more in 1:1 messages than groups, and it may be due to timing. Here's the sequence I traced out:
ERROR: TypeError: Cannot read property 'get' of undefined at Object.getDeviceID
Unhandled Promise rejection: Cannot read property 'get' of undefined ; Zone:
When this happens, the message gets lost. For me, it brings up two questions:
EDIT: for the 2nd question, I just realized that even if we don't delete omemo_store, the sessions probably also need to be rebuilt before it tries to decrypt?
I realized that promises are added to the api.promises array, but never removed. So, maybe once "OMEMOInitialized" happens the first time, the "waitUntil" method will always return immediately instead of waiting because the Promise has already been resolved? Maybe that Promise needs to be removed on disconnect and re-added?
EDIT: I ended up adding "api.promises.add(['OMEMOInitialized']);" right after "delete _converse.omemo_store;" after the connection teardown in order to replace the Promise with a new one. That worked to cause getDeviceId and other things to wait for OMEMOInitialized to happen again, but then it still failed in "decryptWhisperMessage". So I'm not sure if there are other things that need to be re-initialized or not.
Yup, still an issue with 0f648ed1ebdfad4653e78a1e561faceab843a88e
@licaon-kter in my testing that latest commit does fix most of the issue under the hood, but the messages are still lost so its not any different for the user. I am hopeful that https://github.com/conversejs/converse.js/pull/2803 should complete the fix. It works in my testing, I look forward to seeing if it works for you.
@afriedmanGlacier I'll rebuild...
@afriedmanGlacier: yesterday I made some small changes to what you did in #2803 to get the tests to pass.
The main issue was that the parsed messages need to be reassigned to result.messages
otherwise other code which relies on the parsed messages break.
You might want to double check again that things work correctly for you.
See here: https://github.com/conversejs/converse.js/commit/46e6cfb0d4098d10f097eafb0369f148e56c0957
@jcbrand I tried almost that exact same thing before, and couldn't get it to work. Since then, I realized there are two different scenarios that react differently (and I was testing the one that didn't work).
1) With a loss of connection (such as WIFI going down), within a minute I'll start getting ERROR: Websocket closed unexcectedly
messages. With the fixes you made, if I receive a message while offline and the connection comes back fairly quickly after the websocket closes, then I often have issues decrypting the message(s) again.
2) If the connection is closed for a longer period of time, when it comes back online, I am not losing messages and they are are decrypting correctly.
In almost all of my initial testing I was turning off the wifi and waiting for the Websocket closed
log message (Scenario 1 above). I think this occurs when the client or server realizes the TCP connection has been lost. If connection is restored fairly quickly it will resume the session using smacks
.
However, if the connection is gone for more than about 5 minutes, there is a message in converse saying that the session cannot be resumed and a new one will be created. There is a websocket_timeout
configuration parameter in ejabberd that defaults to 5 minutes, or more likely a mod_stream_mgmt
option of max_resume_timeout
(also default 5 minutes) that I think cause this session reset.
In real life, the more likely scenario is the second one (being offline more than 5 minutes), such as when a computer is locked or goes to sleep overnight. In this case, in my testing so far, your fix seems to be working. However, if I receive a message after the websocket is closed but before the time period when the session is reset, it doesn't always work.
Another thing I noticed, if I am open to a conversation in the desktop app in scenario 1 described above, I am more likely to have the decryption issues on a message sent than if a message was sent to a different conversation.
I've tried various things and found one solution that seems to work in both/all cases. It's a bit different than your commit. Assuming it proves to be solid after we test it a bit more, I'll either submit a PR or just post it here to see what you think.
Thanks for the explanation @afriedmanGlacier.
I think the difference between scenario 1 and 2 might be that in scenario 1 the delayed message is not received via MAM.
If the session is resumed via SMACKS, then delayed messages are received as normal messages, not as MAM messages.
The fix you made recently was for MAM messages. So the fix for scenario 1 is likely somewhere else, unrelated to MAM. Probably again some kind of race condition.
Sounds like what I saw, mentioned here: https://github.com/conversejs/converse.js/pull/2803#issuecomment-1029770720
Based on what I was seeing, that makes sense, but I think the race condition is actually related to MAM.
I noticed yesterday that messages were coming in via both MAM and normal method (testing scenario 1 above), but if I remember correctly, I think the MAM message was coming in more quickly and was not getting decrypted for some reason. Then when the message came in normally and was properly decrypted, it was considered a duplicate and was thrown out. I added something from my local testing to your fix, and it seems to work although I'm not sure it's the best solution.
In the handleMAMResult function of headless/plugins/mam/utils.js, I changed
messages.forEach(m => model.queueMessage(m));
to
messages.forEach(m => {
if ((m.is_encrypted && m.plaintext) || !m.is_encrypted) {
model.queueMessage(m);
}
});
Thoughts?
I think the MAM message was coming in more quickly and was not getting decrypted for some reason.
That sounds like the real bug then. We need to figure out the "some reason", e.g. why certain messages don't get decrypted.
Concerning your fix, that seems to be a bandaid around the fact that the message doesn't get decrypted. Better than nothing, but not ideal because the underlying issue isn't fixed.
In your fix you're basically preventing the undecrypted message from being processed further and shown. You're however implicitly relying on the fact that the message will come again in a decryptable form, which isn't guaranteed.
Thinking about this some more... the double-ratchet mechanism of libsignal prevents the same message from being decrypted twice.
So it makes sense that when the same message comes in twice, once via MAM and once via SMACKS resumption, that decryption for one of them will fail. I think that might be the "some reason".
If this is the case, it means the first message comes in, gets decrypted, then the second comes in, can't get decrypted, but it then still overwrites the first message, thereby removing the decrypted plaintext.
In light of that, I think your fix might be on the right track. The code needs to be updated to be aware of the fact that an already decrypted message can't be decrypted again and it shouldn't be allowed to overwrite the already decrypted message (which seems to me is what's happening).
There is message deduplicating code already, perhaps that code needs to be updated to deal with this scenario as well.
Sounds like you have a much better idea of how to proceed than I do :)
I'll see if I can reproduce it with a test and come up with a fix.
@afriedmanGlacier Please check whether this commit fixes it for you: 906f8d63347b453d1d5b44dbe637f539d20d6966
EDIT: I noticed an error and force-pushed to the original commit
I was testing with the original push, and was about to respond that it wasn't working quite right, but I'll test the update tomorrow. Thanks!
Progress! But not quite there. Now it is decrypting, updating, and showing on screen briefly, but the message is still disappearing a few seconds later as if the is_ephemeral
part isn't getting reset to false
with the update.
I tried also adding this to the updated attributes, but it didn't help:
is_error: false,
type: attrs.type,
One thing I noticed is that the one message received after the session is resumed by SMACKS starts decrypting (getting into libsignal and thereby "winning" the race), but the other MAM message is received at almost the same time, fails and completes processing more quickly. So the message object that is saved is the failed one and should be updated, but I'm not sure why after it updates it still disappears as if it was ephemeral.
Every once in a while the non-MAM message finishes decryption before the MAM message starts processing, and then the message stays on screen and doesn't disappear, like we would expect.
I think I figured this last issue out. When Message is initialized (headles/plugins/chat/message.js), setTimerForEphemeralMessage
is called. But this.on('change:is_ephemeral')
is only set if the message is NOT ephemeral to start with. So if it is created with an error and is ephemeral to start with, the timer doesn't get cleared when it changes.
So I locally moved the below code outside of the if (this.isEphemeral())
conditional block, and it worked correctly. Was it inside only the else
section for a reason? Will the change I made affect anything negatively?
this.on('change:is_ephemeral', () =>
this.isEphemeral() ? setTimer() : clearTimeout(this.ephemeral_timer)
);
Testing with HEAD (https://github.com/conversejs/converse.js/commit/0fb4dcac9693f6ce883daef8bfca23db94aba173) I saw the error again at one point, red text, notification with actual message decrypted.
After an OS restart I shared a link from another account to this user, then I logged in Converse.
That 1:1 chat had 2 pics shared.
Waiting for the pictures to appear the link appeared first, but was positioned above the pics, basically was shown 5h earlier than the actual post time
@afriedmanGlacier: Great work, thanks!
Was it inside only the else section for a reason? Will the change I made affect anything negatively?
Looks like a bug to me. I've pushed a commit that fixes it based on your analysis.
Please let me know whether this is enough to close this ticket now
Since you are listening for a change in type
, would you need to update that along with the other attributes in getUpdatedMessageAttributes
? On the same topic, does it make sense to update is_error
to false in that case since we set the error attributes to undefined? Something like:
return Object.assign({}, attrs, {
error_condition: undefined,
error_message: undefined,
error_text: undefined,
error_type: undefined,
is_archived: attrs.is_archived,
is_error: false,
type: attrs.type,
is_ephemeral: false,
});
I'll test today and get back to you. Thanks!
type
is already contained in attrs
, so there's no need to explicitly state it again.
Looks like I missed is_error
, so I added that as undefined
, thanks for the tip.
Tests look good so far, I think this can be closed! Thank you @jcbrand!
Looks okaish from here too
Describe the bug When re-opening the client after it has been closed, during which time encrypted messages have been sent to that user, the messages are not decrypted.
Note: This was originally part of https://github.com/conversejs/converse.js/issues/2241 but seems to be a different issue.
To Reproduce Steps to reproduce the behavior:
Expected behavior The decryption issue does not happen 100% of the time, but quite often. Messages should always be able to be decrypted.
Environment (please complete the following information):
Additional context After the update for getting sender real JID (https://github.com/conversejs/converse.js/commit/33f3cb76aded2312888fc7a5d493813f850548c5), we are correctly receiving the JID, but the decryption still often fails when opening the app for messages that arrived while closed. In looking at the message attributes during the decryption process, I noticed that it fails with no warning in plugins/omemo/utils.parseEncryptedMessage because it doesn't have attrs.encrypted.key.
In the attrs: "is_encrypted":true,"encrypted":{"device_id":"406865841"}. In other words it doesn't have the iv, key, etc.
However, the correct rid is clearly among the keys in the message. In headless/shared/parsers.getEncryptionAttributes, the line const device_id = _converse.omemo_store?.get('device_id'); doesn't return a device_id. The store appears to exist but device_id doesn't seem to be accessible yet when the first MAM messages come in for some reason.
In #2241 , @licaon-kter made the comment "Messages sent while Converse was not connected are almost surely not decrypted. Note: no key change, no IndexDB clean up. The old messages (already decrypted) are shown fine though." which sounds like it might be related to this particular error.