eclipse-mosquitto / mosquitto

Eclipse Mosquitto - An open source MQTT broker
https://mosquitto.org
Other
9.1k stars 2.4k forks source link

Draining messages between two bridged mosquitto servers #1612

Open detyo opened 4 years ago

detyo commented 4 years ago

Hi, I'm testing draining of QoS2 messages between two Mosquitto MQTT brokers and I'm getting failures if overall message count exceeds certain threshold. My setup is the following - one Mosquitto server configured to bridge to another one, using plain mqtt/tcp, no authentication. At first, the second Mosquitto server is shutdown and a an external client (Eclipse Paho) is sending 100k QoS2 mqtt 3.1.1 messages, 200 bytes each to the first Mosquitto server. After all messages are published successfully, I start the second Mosquitto server and the bridged connection is established. At this point, first Mosquitto server is starting to publish the messages to the second one, but for some reason it does this so fast that it seems to fail to process responses - actually "Publish received" responses seem to get send too slow and at some point, the 2nd Mosquitto server sees "TCP ZeroWindow" and the connection gets reset. This happens roughly at 65000th event or so, which leads me to believe I'm hitting some kind of a threshold either in Mosquitto or TCP stack. After that, the bridged connection is re-established and the first server tries to re-send the messages, which fails again and the process is repeated.

I'm running Mosquitto 1.6.9 built from source on Windows (both servers are on same machine but on different ports), I have also tried running first server on Debian (Raspberry PI 3) and I'm getting same results.

Is there any configuration which I'm not aware of or any spec restriction or Mosquitto internal behaviour that can explain this?

I'm attaching Mosquitto bridge configuration, I can attach logs if required, but this should be relatively easy to reproduce. mosquitto.conf.txt

Thanks, Detelin

BrandtHill commented 4 years ago

MQTT pub/sub related packets have a 2-byte packet identifier used for when QoS > 0. That could explain the error occurring around 64K/max_uint16 area. Maybe try it with QoS 0 and see if it still chokes at 64K.

detyo commented 4 years ago

Hi @BrandtHill , I could not reproduce the issue with QoS0, so you might be right about this. I'm not sure how to debug this further, I was wondering what is the internal timeout which Mosquitto uses to assume message is not delivered and attempt re-delivery when publishing with QoS>0 - perhaps if there is a way to increase this, it might work?

BrandtHill commented 4 years ago

I'm not sure how many re-publish attempts it makes nor the amount of time it waits to determine if it should attempt a republish. I don't see anything that stands out in the configuration settings. I would look through the source code to find out what it is exactly. Maybe try dumping 60K messages instead of 100K, that way it would be under the 64K limit, and you could determine if that was the culprit.

Also if you set max_inflight_messages and max_queued_messages to actual values it might slow it down to the point where the broker/network can keep up and acknowledge the incoming messages.