esl / MongooseIM

MongooseIM is Erlang Solutions' robust, scalable and efficient XMPP server, aimed at large installations. Specifically designed for enterprise purposes, it is fault-tolerant and can utilise the resources of multiple clustered machines.
Other
1.66k stars 426 forks source link

Duplicated messages in mam_message table #4316

Open rthouvenin opened 1 month ago

rthouvenin commented 1 month ago

MongooseIM version: 6.2.1 Installed from: Docker image Erlang/OTP version: provided by the docker image

In case this is relevant: I am using JWT authentication

I am essentially facing the problem described in #4128.

On some occasions, all the messages of a conversation received by a user get duplicated in database in the mam_message table. This does not happen consistently (was not able to describe a reproducing scenario yet), and happens only when stream management is enabled and I think this happens when one of the users requests the conversation history. This makes me think there is a bug when there is a problem in delivering the history response that triggers a duplication of the history in DB.

When this happens for a user, it is likely to happen again soon, often resulting in dozens of copies of all received messages in the conversation. All the messages (same origin_id) are identical except for the timestamp, which seems to be the time when the duplication occured. So for example if a user received a single message in a conversation, and the bug occurs 3 times for that conversation, you would see something like the below in the mam_message table:

id |  origin_id   | direction | timestamp (extracted from mongooseimctl)
1  | message-id-1 | I         | 2024-07-04 14:53:42
2  | message-id-1 | I         | 2024-07-04 14:55:12
3  | message-id-1 | I         | 2024-07-04 15:03:27
4  | message-id-1 | I         | 2024-07-04 15:03:27
5  | message-id-1 | I         | 2024-07-04 18:25:51
6  | message-id-1 | I         | 2024-07-04 18:25:51
7  | message-id-1 | I         | 2024-07-04 18:25:51
8  | message-id-1 | I         | 2024-07-04 18:25:51
rthouvenin commented 1 month ago

I have not been able to find a reproducing scenario, but I was able to reproduce a few times by abruptly killing connections from the client side and creating new ones immediately after. This does not trigger the problem consistently though.

During these attempts I gathered as much traces and logs as I could to try to get clues about what's going on. Here is what I found that might be relevant:

Here is an example:

<!-- ORIGINAL -->
<message xmlns='jabber:client' 
         id='6d104193-5155-42c6-aaa2-e558c4079bc6' 
         to='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' 
         type='chat'>
    <origin-id xmlns='urn:xmpp:sid:0' id='6d104193-5155-42c6-aaa2-e558c4079bc6'/>
    <body>THIS IS MESSAGE 2!</body>
    <markable xmlns='urn:xmpp:chat-markers:0'/>
</message>

<!-- DUPLICATE -->
<message from='b4958e4b-fa52-46db-972e-9ad288776e62@localhost/1720-704391-9018-d7c41bad90715457' 
         to='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' 
         xmlns='jabber:client' 
         id='6d104193-5155-42c6-aaa2-e558c4079bc6' 
         type='chat'>
    <origin-id xmlns='urn:xmpp:sid:0' id='6d104193-5155-42c6-aaa2-e558c4079bc6'/>
    <body>THIS IS MESSAGE 2!</body>
    <markable xmlns='urn:xmpp:chat-markers:0'/>
    <stanza-id by='c884302f-7797-4ca4-b68c-3a431a7d9024@localhost' id='C77OREVP3K01' xmlns='urn:xmpp:sid:0'/>
</message>

The added stanza id is not the same for all duplicated messages. For instance, the message of my above example got duplicated 98 times, but there are 34 different stanza ids in the duplicate messages, each being present from 2 to 9 times. From what I can see the same stanza id is NOT used for the duplicates of two different original messages.

All of the above make me say: it looks like mongooseim keeps somewhere in memory or some transient storage a queue of stanzas even after they were sent and acknowledged (as mentioned in my original post, this seems to occur only when stream management is enabled), and these stanzas get routed as carbon copies to new clients of the same user. If they don't get acked by the new client, they get stored to the mam table again. This is of course only a rough idea of what I understand, in the hope that it may help someone with better knowledge of MongooseIM internals to understand the problem.

Let me know if there is more information that I can provide or experiments that I can do (bear in mind that I still don't know how to reproduce consistently). In my latest example, I have server logs (debug level), CSV dump of the mam_message table, websocket traces of clients.

JanuszJakubiec commented 1 week ago

Hi, thank you for bringing this issue to our attention. We are currently investigating it and will keep you updated on our progress.

rthouvenin commented 5 days ago

Thanks for looking at it. Something happened to me yesterday that might help the investigation:

I wanted to switch the storage format of the mam table to XML for a test instance, so I truncated the tables mam_message and inbox before changing the configuration. I noticed that some old messages re-appeared in the mam_message table, shortly after I emptied the table. The messages were all copies of a single old message (sent in July), which belonged to a user that had connected and disconnected just before I emptied the table. The disconnection was not clean (the user killed the client because of a bug).

I wanted to share this because it gives more evidence towards some of my guesses I made above : happens when a client abruptly disappears, and the messages are copied from a transient source (they could not be copied from DB since the table was empty). Hope this helps.