TryQuiet / quiet

A private, p2p alternative to Slack and Discord built on Tor & IPFS
https://www.tryquiet.org
GNU General Public License v3.0
1.96k stars 85 forks source link

Missing messages (unexplained) #381

Closed holmesworcester closed 2 years ago

holmesworcester commented 2 years ago

Right now some messages never display and we don't know why. It happens with users we've seen before, not just new users.

I posted some logs to slack showing a case where this happens.

https://zbay.slack.com/files/UTAQELTJ8/F038LS34KCM/archive.zip

holmesworcester commented 2 years ago

the solution should include a nectar/waggle regression test for this, I think.

holmesworcester commented 2 years ago

Related: https://github.com/ZbayApp/monorepo/issues/394

holmesworcester commented 2 years ago

Here's an example of what I'm seeing. Note that it's from an account I've already seen messages from, which invalidates the hypothesis that missing messages are only due to slowness syncing the user table.

Also, it's the older instance that is missing the messages. So there's some other issue here.

image

holmesworcester commented 2 years ago

Ideas:

My memory is that we've been seeing this issue for a while.

siepra commented 2 years ago

It seems not to be a problem with sagas' logic for verification/filtering out messages https://github.com/ZbayApp/monorepo/pull/395

holmesworcester commented 2 years ago

This happened again in Quiet alpha 5. It happened after my Mac version was reconnecting to the network after being asleep for a while. It synced some but not all new messages.

EmiM commented 2 years ago

Have those messages never came or came but with a big lag?

Did you manage to find the repeatable way to see this problem? You wrote that it happened after computer was asleep for a while - does it always happen this way?

holmesworcester commented 2 years ago

I don't have the machine where the issue happened, so I can't say. You could possibly check to confirm this by using the files I sent you for the data directories.

I didn't manage to find steps to reproduce it.

EmiM commented 2 years ago

Edit: I only could make it work on windows (aka "second machine"). I am not sure if that's because of the OS or the fact that I unplugged the ethernet cable but I couldn't make it work the other way (Linux being disconnected, Windows sending messages).

I managed to repeat the similar problem but only by disconnecting one of the Quiet apps from the network without closing it. Those are the steps I took:

After quiet3 and quiet2 reconnected they were able so send and receive new messages but no replication of the past messages happened.

EmiM commented 2 years ago

New discovery: image (2)

Missing message "I received a message but Windows did not start replicating missing messages. Will it trigger now?"

The logs show that orbitdb did receive this entry but it didn't trigger replicate.progress event and that's why we are missing it in our app. This can be a different case than the one I described in the comment above because It was triggered by reopening the quiet1 again at some point of testing.

Attaching all logs from app with the broken state and part of the logs from the app with a proper state: app1MissingMessage.log app2AllMessages.log app1MissingMessagesFinalSnapshot.log

The state didn't heal on message sending nor receiving, it also didn't heal on restarting the apps. It makes sense since the entry is already saved in the local orbitdb store. However it's good news because in this case we just have to implement a mechanism that makes sure that we gathered all needed entries.

EmiM commented 2 years ago

https://github.com/orbitdb/orbit-db-store/issues/122 Created issue in orbit-db-store for the case described above ^

EmiM commented 2 years ago

We decided to close this task because we already have a workaround on our side so it should not affect user experience anymore. Orbitdb guys don't know what can be a cause of this but they will be working on the replicator rewrite anyway so the best thing right now is to move on and wait for their refactoring.