Frequent message loss with MQTT #830

opened 3 months ago

We have the following setup on production for MQTT.

  1. 5 EMQX broker(Version 3.X)
  2. AWS Load balancer to distribute load across MQTT brokers (and HAProxy in some enviroments)
  3. Paho MQTT python client (Version 1.1)

We are noticing an issue where messages are getting frequently dropped(around 1 or 2 in every 100 messages).

MQTT connect configuration setup is as follows

client_id = "<random_int_from_1_to_100>_<current_hostname>"
clean_session = False
keep alive timeout = 60

How the messages are published ? We have X number of celery workers publishing to the same topic in parallel, with message rate of 10/s at max. The client id is unique across each celery worker as it using hostname in client id.

For the messages which are getting dropped or missed, paho MQTT library is returning a 0 on publish indicating the message was published successfully.

Sample code for publish

        (res, mid) = self.conn.publish(topic=topic, payload=payload, qos=qos)
        if res == 0:
           log.debug(f"Succesfully published message::{str(res)} with id {mid} for payload::{payload}",
 "Error publishing message::{str(res)} with id {mid} for payload::{payload}",

But there are no logs EMQX(even with debug logs enabled), for the ones which have been dropped. This is happening only on production where there are multiple clients publishing to same topic, whereas with single client we haven't noticed an issue.

Is there any issue with the configuration of the above or would upgrading to a newer version of the library help fix the issue? OR this could be something specific to the EMQx broker.


"Library version: 1.1" - this dates back to 2015 and there have been a considerable number of updates in the interim (some addressing issues that could, potentially, lead to message loss). I would suggest trying the latest release (but note that V2 has dropped support for Python 3.6). Unfortunately I suspect you will struggle to find anyone prepared to attempt to diagnose the issue with a version of the library this old (especially as issues like this that can be very hard to duplicate).

Paho MQTT library is returning a 0 on publish indicating...

Please note that publish will return before the transaction is complete (and, potentially, before the message is even sent if _max_inflight_messages applies).

Other than that I cannot see anything obviously wrong with the code snippets provided (but you don't show the network loop etc). The fact that EMQX is not logging receipt of a message does seem to indicate that the issue is on the client side (access to the logs may be useful; it would be interesting to see if there is a gap in the message IDs).

