eclipse-mosquitto / mosquitto

Eclipse Mosquitto - An open source MQTT broker
https://mosquitto.org
Other
9.07k stars 2.39k forks source link

1.6 bridge won't reconnect on bridge connection loss #1334

Open caevv opened 5 years ago

caevv commented 5 years ago

After upgrading to 1.6.3 from 1.5.x I found that when the bridge connection is lost the bridge won't recover when the connection is re-established.

Only solution has been restarting mosquitto.

Mosquitto running on docker.

mosquitto.conf

port 1883

pid_file /var/run/mosquitto.pid

persistence true
persistence_location /mosquitto/data/

# queue unlimited messages
max_queued_messages 0

log_type all
log_dest file /mosquitto/log/mosquitto.log

connection bridge-01
address mqtt.host.com:8883
topic # out 2 v1/vessel/ v1.parenttopic/xxxx/
remote_clientid xxxx
notifications true
cleansession false
keepalive_interval 180
restart_timeout 60 300
bridge_cafile /mosquitto/config/mqtt-ca.cert
bridge_certfile /mosquitto/config/mqtt-bridge-client.cert
bridge_keyfile /mosquitto/config/mqtt-bridge-client.key

Bridge name: bridge-01 Connection lost at 1562929130 Connection re-established at 1562929523


1562929130: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929130: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929135: Bridge xxxx sending CONNECT
...
1562929267: Socket error on client local.xxxx, disconnecting.
1562929273: Received PINGREQ from gateway.publisher
...
1562929328: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929328: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929333: Bridge xxxx sending CONNECT
...
1562929523: Bridge local.xxxx doing local SUBSCRIBE on topic v1/topic/#
1562929523: Connecting bridge bridge-01 (mqtt.host.com:8883)
1562929524: Bridge xxxx sending CONNECT
1562929525: Saving in-memory database to /mosquitto/data/mosquitto.db.
1562929529: Received CONNACK on connection local.xxxx.
1562929529: Bridge local.xxxx sending UNSUBSCRIBE (Mid: 11973, Topic: v1.parenttopic/xxxx/#)
1562929529: Sending PUBLISH to local.xxxx (d1, q2, r0, m11954, 'v1.parenttopic/xxxx/subtopic', ... (154 bytes))```
minjatJ commented 5 years ago

Hi there,

I am having the same issue with mosquitto running on 1.6.3 and bridge on 1.6.3 or 1.5.8 after a client lost connection 1564044750: Client dev-someid has exceeded timeout, disconnecting.

Logs from the broker are very interesting as they show that messages are actually received but no PUBREL is sent back! Which is the case in working conditions. After a restart of my mosquitto client PUBREL are sent again and everything comes back to normal.

Expected logs (From a working state)

1564051594: New connection from x.x.x.x on port 8883.
1564051594: New bridge connected from x.x.x.x as dev-someid (p2, c0, k180, u'bridge.host.com').
1564051594: Will message specified (1 bytes) (r1, q1).
1564051594:     $SYS/broker/connection/dev-someid/state
1564051594: Sending CONNACK to dev-someid (1, 0)
1564051594: Received PUBLISH from dev-someid (d0, q1, r1, m1, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564051594: Sending PUBACK to dev-someid (m1, rc0)
1564051594: Sending PUBLISH to auto-DA5F9AAD-5070-C704-6D22-89ECCBE38A0B (d0, q0, r0, m0, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564051594: Sending PUBREC to dev-someid (m1870, rc0)
1564051594: Received UNSUBSCRIBE from dev-someid
1564051594:     some/topic/dev-someid/#
1564051594: dev-someid some/topic/dev-someid/#
1564051594: Sending UNSUBACK to dev-someid
1564051594: Received PUBREL from dev-someid (Mid: 1870)
1564051594: Sending PUBCOMP to dev-someid (m1870)
1564051594: Sending PUBREC to dev-someid (m1871, rc0)

Observed logs (No PUBREC is send)

1564048435: New connection from x.x.x.x on port 8883.
1564048435: New bridge connected from x.x.x.x as dev-someid (p2, c0, k180, u'bridge.host.com').
1564048435: Will message specified (1 bytes) (r1, q1).
1564048435:     $SYS/broker/connection/dev-someid/state
1564048435: Sending CONNACK to dev-someid (1, 0)
1564048435: Received PUBLISH from dev-someid (d0, q1, r1, m1973, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564048435: Sending PUBACK to dev-someid (m1973, rc0)
1564048435: Sending PUBLISH to auto-DA5F9AAD-5070-C704-6D22-89ECCBE38A0B (d0, q0, r0, m0, '$SYS/broker/connection/dev-someid/state', ... (1 bytes))
1564048435: Received UNSUBSCRIBE from dev-someid
1564048435:     some/topic/dev-someid/#
1564048435: dev-someid some/topic/dev-someid/#
1564048435: Sending UNSUBACK to dev-someid
1564048435: Received PUBLISH from dev-someid (d1, q2, r0, m1870, 'some/topic/dev-someid/my/topic', ... (79 bytes))
1564048435: Received PUBLISH from dev-someid (d1, q2, r0, m1871, 'some/topic/dev-someid/another/topic', ... (153 bytes))
1564048442: Received PUBLISH from dev-someid (d0, q2, r0, m1977, 'some/topic/dev-someid/not/the/same/topic', ... (74 bytes))
ralight commented 5 years ago

Thanks for the second report @minjatJ, I haven't yet reproduced this. Are you also running on Docker?

ralight commented 5 years ago

Thanks for the different reports both of you, I'm struggling to reproduce this. Do you have any more hints on what you're seeing?

minjatJ commented 5 years ago

Hi @ralight thanks for looking into this. I am running on Docker as well. Unfortunately the issue is very intermittent (Seems to be depending a lot on when the connection is dropped and comes back). I can't manage to reproduce it consistently. I am still trying to figure out what is the exact scenario. I'll post updates if I have any through the day.

m0wlheld commented 3 years ago

I can reproduce it:

Having an existing and working bridge, when restarting the broker on "bim", connection from "denise" dies and is not able to recover without restarting the Mosquitto instance on "denise".

Logs from denise around the time of restart of "bim":

1613997839: Client local.denise closed its connection.
1613997845: Connecting bridge bim (bim.fritz.box:1883)
1613997845: Client local.denise disconnected: Broken pipe.
1613997851: Connecting bridge bim (bim.fritz.box:1883)
1613997917: Connecting bridge bim (bim.fritz.box:1883)
1613997983: Connecting bridge bim (bim.fritz.box:1883)
1613998049: Connecting bridge bim (bim.fritz.box:1883)
1613998115: Connecting bridge bim (bim.fritz.box:188

Logs from bim after having it restarted:

1613997849: mosquitto version 1.6.9 starting
1613997849: Config loaded from /etc/mosquitto/mosquitto.conf.
1613997849: Opening ipv4 listen socket on port 1883.
1613997849: Opening ipv6 listen socket on port 1883.
1613997851: New connection from 192.168.33.51 on port 1883.
1613997854: New connection from 192.168.33.31 on port 1883.
1613997911: Socket error on client <unknown>, disconnecting.
1613997917: New connection from 192.168.33.51 on port 1883.
1613997977: Socket error on client <unknown>, disconnecting.
1613997983: New connection from 192.168.33.51 on port 1883.
1613998043: Socket error on client <unknown>, disconnecting.
1613998049: New connection from 192.168.33.51 on port 1883.
1613998109: Socket error on client <unknown>, disconnecting.
1613998115: New connection from 192.168.33.51 on port 1883.
ralight commented 3 years ago

@m0wlheld I've just tried that with a fresh remote ubuntu 20.04 VM and can't reproduce it. Is there anything special in your configuration files I should know about?

m0wlheld commented 3 years ago

Nothing special (I think), I'll share.

/etc/mosquitto/mosquitto.conf on Ubuntu host "bim.fritz.box":

pid_file /var/run/mosquitto.pid
persistence true
persistence_location /var/lib/mosquitto/
log_dest file /var/log/mosquitto/mosquitto.log
include_dir /etc/mosquitto/conf.d

/etc/mosquitto/conf.d/auth.conf on Ubuntu host "bim.fritz.box":

allow_anonymous false
password_file /etc/mosquitto/passwd

/mosquitto/config/mosquitto.conf for container instance on Raspbian host "denise.fritz.box":

listener 1883 0.0.0.0
persistence false
allow_anonymous false
password_file /mosquitto/config/mosquitto.passwd
connection bim
address bim.fritz.box:1883
remote_clientid denise
remote_password ***************
remote_username denise
ralight commented 3 years ago

Thanks - I agree it doesn't look much special, but it's always nice to have exactly the same conditions to work with.

toschoch commented 3 years ago

I observe the same problem here: The setup is a bridge connection:

Procedure:

  1. start A, then start B -> everything works fine: logs A:
    1614955646: mosquitto version 2.0.7 starting
    1614955646: Config loaded from /mosquitto/config/mosquitto.conf.
    1614955646: Opening ipv4 listen socket on port 1883.
    1614955646: mosquitto version 2.0.7 running
    1614955677: New connection from 172.20.0.1:33862 on port 1883.
    1614955677: New client connected from 172.20.0.1:33862 as A-bridge (p5, c1, k60, u'A').

logs B:

1614956548: mosquitto version 2.0.7 starting
1614956548: Config loaded from /mosquitto/config/mosquitto.conf.
1614956548: Opening ipv4 listen socket on port 1883.
1614956548: Connecting bridge A-bridge (<ip-A>:1883)
  1. start B, then start A -> bridge won't connect: logs A:
    1614956924: mosquitto version 2.0.7 starting
    1614956924: Config loaded from /mosquitto/config/mosquitto.conf.
    1614956924: Opening ipv4 listen socket on port 1883.
    1614956924: mosquitto version 2.0.7 running
    1614956936: New connection from 172.20.0.1:33936 on port 1883.
    1614956996: Client <unknown> closed its connection.
    1614957017: New connection from 172.20.0.1:33940 on port 1883.
    1614957077: Client <unknown> closed its connection.

    logs B:

    1614956753: mosquitto version 2.0.7 starting
    1614956753: Config loaded from /mosquitto/config/mosquitto.conf.
    1614956753: Opening ipv4 listen socket on port 1883.
    1614956753: Connecting bridge A-bridge (<ip-A>:1883)
    1614956753: mosquitto version 2.0.7 running
    1614956754: Client local.A-bridge disconnected: Broken pipe.
    1614956774: Connecting bridge A-bridge (<ip-A>:1883)
    1614956855: Connecting bridge A-bridge (<ip-A>:1883)
    1614956936: Connecting bridge A-bridge (<ip-A>:1883)

A: mosquitto.conf:

log_dest stdout
log_type error
log_type warning
log_type notice
log_type information
log_type subscribe
log_type unsubscribe
log_type websockets

password_file /run/secrets/passwd

listener 1883 0.0.0.0

allow_anonymous false

B: mosquitto.conf

persistence true
persistence_location /mosquitto/data/

log_dest stdout
log_type error
log_type warning
log_type notice
log_type information
log_type subscribe
log_type unsubscribe

password_file /mosquitto/config/passwd

listener 1883 0.0.0.0
allow_anonymous true

connection A-bridge
bridge_protocol_version mqttv50
address <ip-A>:1883
topic # both
remote_username A
remote_password *******
remote_clientid A-bridge
restart_timeout 20 600

notifications true
cleansession true
start_type automatic
try_private true
perfeldbak commented 3 years ago

I see exactly same problem as Toschoch with Mosquitto 2.07 installed on Windows 10. Anybody know how to make a quick workaround?

mike-sirs commented 3 years ago

I see exactly same problem as Toschoch with Mosquitto 2.07 installed on Windows 10. Anybody know how to make a quick workaround?

Ubuntu 20.04 mosquitto version 2.0.8

I've found a workaround using docker-compose healthcheck.

a couple of steps

version: "3.8"
services:
  autoheal:
    image: "willfarrell/autoheal:latest"
    tty: true
    container_name: "autoheal"
    restart: always
    volumes:
      - "/var/run/docker.sock:/var/run/docker.sock"

  mosquitto:
    image: "eclipse-mosquitto"
    ports:
      - "1883:1883/tcp"
      - "9001:9001/tcp"
    volumes:
      - "/host/mosquitto/config:/mosquitto/config"
      - "/host/mosquitto/data:/mosquitto/data"
      - "/host/mosquitto/log:/mosquitto/log"
    labels:
      - "autoheal=true"
    healthcheck:
      test: ["CMD", "/mosquitto/config/hc.sh"]
      interval: 5s
      timeout: 10s
      retries: 3
    restart: unless-stopped

/mosquitto/config/hc.sh

#!/bin/sh

function_to_fork() {
  HC=$( mosquitto_sub -h 127.0.0.1 -p 1883 -t "test/topic_hc" -u hass -P "MeGaPa$$wD" -C 1 )
  if test "$HC" != 'alive'; then
    echo "> $HC result"
    exit 1
  fi
}

function_to_fork &
mosquitto_pub -h 10.0.0.1 -p 1883 -t 'test/topic_hc' -m 'alive' -u hass -P 'MeGaPa$$wD'
wait -n
peter-vanpoucke commented 3 years ago

I had this issue in 2.0.10, but in 2.0.11 it seems fine. (using docker)

vomov commented 2 years ago

I'm still having the issue on 2.0.14, where I've got three brokers. When restarting everything (docker compose), two brokers start up at nearly the same time, which means that one isn't ready when the other is trying to connect. The other then never tries again. A workaround shouldn't be needed, right?

Here's how I've set it up: Broker A has a bridge set up to broker B Broker B has a bridge set up to broker C

If broker A restarts, it'll reconnect to broker B. If broker B restarts, it'll reconnect to broker C, but broker A will state Client broker_b closed its connection., and it never tries again. The same goes for broker C restarting: broker B will never try again.

In all mosquitto.conf I have the lines:

keepalive_interval 60
start_type automatic
restart_timeout 10 180

Why does it not reconnect?

vomov commented 2 years ago

Just noticed something odd.

I actually have a fourth broker, which I don't really use. This broker is also connected as a bridge, and this one DOES reconnect no matter what. Brokers A, B, and C are all running as docker containers on RPi's (one on a RPi Zero W), and none of them attempt to reconnect. Broker D is running as a docker container on Debian.

Can this explain the issue?

michaelfanta commented 2 years ago

Hi,

I have a similar problem.

I created a Broker on a Raspberry PI and connected it to my main MQTT.

The device connects to the Raspberry Broker and works without problems.

However, the Broker disconnects from the main MQTT after some time and no longer reconnects.

When I restart it, it works again until it disconnects again.

Does anyone know why this is happening?

This is a very strange behavior, communication between MQTT servers works smoothly in both directions until disconnected.

17:55:07:: 1649699707: New bridge connected from 172.17.0.1:40846 as bridge01 (p2, c1, k15, u'hassio'). 17:55:43:: 1649699743: Client bridge01 has exceeded timeout, disconnecting.

Is it necessary to set an interval for the check or turn it off completely or something similar?

Thank you in advance for any answer, M.

vomov commented 2 years ago

@michaelfanta; the timeout should be caught by keepalive_interval 60, which you may need to reduce to 30 or something. However, the restart_timeout 10 180 should make it reconnect as well.

michaelfanta commented 2 years ago

I just tried that and it still disconnects from the main MQTT server. Here I attach my config file. No need to set protocol type or something? It's weird that it always disconnects after a while.

listener 1883 0.0.0.0

persistence true
persistence_location /mosquitto/data/

allow_anonymous false
password_file /mosquitto/data/passwd

log_dest file /mosquitto/log/mosquitto.log

connection bridge01
address xxx.xxx.xxx.xxx:1883
topic # both 0
cleansession false
remote_username test
remote_password test
remote_clientid bridge01
restart_timeout 10 180
keepalive_interval 30

notifications true
start_type automatic
try_private true

I'm testing it on a Raspberry. I have there connected to MQTT one SonOff MINI (tasmota).

If I change the states of the switch, the connection seems to be stable. However, when the status stops changing, the MQTT Bridge disconnects from the main MQTT.

Shouldn't the MQTT Bridge somehow send regular data to the main MQTT to avoid disconnection?

I also tried to set "start_type" to "lazy" but without effect.

EDIT I also watched that when "keepalive_interval" was set to "30", it disconnected after 30 seconds. Therefore, I set this interval to the maximum value of 65535 and has not disconnected yet.

I also watched that the date and time did not display correctly in the log file. The time is displayed as: Sun Jan 100 4174038:2123955838:00 1900

So I set the value "privileged" to "true" in "docker-compose.yml" and the time and date are displayed correctly. Do you think this could be related to a previous disconnection?

vomov commented 2 years ago

Privileged containers is almost never a good idea, especially not when used to circumvent another bug. For system time in the container, you can best do this: ` volumes:

However, the fact that it doesn't reconnect after (wrongly) losing the connection is the same bug I am experiencing.

michaelfanta commented 2 years ago

Thank you so much, this is how it works without problems :)

As for MQTT, he probably didn't mind that there was no communication. Now that sensor states are being sent, the MQTT does not disconnect.

But as you say, when a connection is lost, it doesn't reconnect, which is weird.

I'm trying to solve something similar with Tasmot FW. It connects to the MQTT and when the MQTT is Offline for some time, the Tasmota device no longer logs on. Each time a connection attempt fails, the time is shifted until it finally stops trying to connect, which is incorrect.

I'm still trying to find some information about why "restart_timeout" isn't responding .. I was hoping that it would be enough to set "start_type" to "lazy", but even that doesn't affect me.

I even tried to set "threshold = 1" to respond to each message, but it will never reconnect.

For now, I'm dealing with a Python-based script, when the mqtt connection to the broker is lost, the MQTT on RPI restarts to reconnect, but it's not really a permanent solution.

greg-mcnamara commented 2 years ago

I have the same issue with 2.0.11 on both bridge and broker (Docker). I notice that sometimes when the bridge starts up and tries to connect the broker, it doesn't receive the CONNACK (even though I can see the broker sending the CONNACK in its logs, another problem I'm looking into). After that the keep alive fires every 30 seconds and the ping requests work fine on both ends, but the bridge never publishes a message to the broker after that and never tries to reconnect. When it does connect ok everything works perfectly. It's almost like it's waiting on the CONNACK forever. How can I get the bridge to retry the connection when it doesn't receive a CONNACK?

greg-mcnamara commented 2 years ago

Reading the mosquitto conf manual page here I see with the restart_timeout setting the default is to initially retry after 5 seconds and then increase up to maximum retry period of 30 seconds. So in theory I should see a reconnect attempt every 30 seconds (without configuring the restart_timeout) but I don't. I do see a keep alive (which is successful) and also happens to be set to 30 seconds. Is it possible the successful keep alive overrides the reconnect attempt and leaves the bridge in a permanent "happily disconnected" state?

vomov commented 2 years ago

"Happily disconnected" does seem appropriate in this case, as the broker never appears to retry connecting. I'm starting to suspect there are multiple problems here:

Let's say it does reconnect after five seconds, and it is indeed due to the keep_alive; this would mean that the broker drops messages for five seconds every time it tries to keep the connection alive.

I experienced the problem when restarting brokers, where the order in which they are restarted is dependent on which broker initiates the connection, and then only on a RPi Zero W.

vomov commented 2 years ago

I've recently found this to be an issue with a friends' setup as well. He's also using a rPi Zero W which doesn't reconnect.

everactivetim commented 2 years ago

I believe we may have a similar issue happening but it's intermittent (meaning if we restart mosquitto it may or may not happen) but it does seem roughly 50% of the time it happens.

I've noticed that for us it appears as though this log statement is always present when it happens:

2022-08-17T21:40:34Z mosquitto[24606]: 1660772434: Expiring client brXXXXXXX due to timeout.
everactivetim commented 2 years ago

Confirming that 1.5.8 does not exhibit the issue (just tested 5 times and would have expected to see it once).

everactivetim commented 2 years ago

1.6.15 did not exhibit the issue either. Actually I'm not sure this is the same issue for us. I think this broke for us between 2.0.14 and 2.0.15 when this (guessing):

/* FIXME - we should expire clients here if they have exceeded their time */

was replaced by this:

session_expiry__add_from_persistence(context, chunk.F.session_expiry_time);

Sorry for the noise on this issue.

vomov commented 2 years ago

Reading the mosquitto conf manual page here I see with the restart_timeout setting the default is to initially retry after 5 seconds and then increase up to maximum retry period of 30 seconds. So in theory I should see a reconnect attempt every 30 seconds (without configuring the restart_timeout) but I don't. I do see a keep alive (which is successful) and also happens to be set to 30 seconds. Is it possible the successful keep alive overrides the reconnect attempt and leaves the bridge in a permanent "happily disconnected" state?

/* FIXME - we should expire clients here if they have exceeded their time */

was replaced by this:

session_expiry__add_from_persistence(context, chunk.F.session_expiry_time);

I'd say it's likely that the problem originates from something like this. Could we somehow get this issue 'bumped'?

svet-b commented 2 years ago

I'm not sure if this is related or not, but on 2.0.15, we're experiencing the same issue that @everactivetim described. We have a number of bridged Mosquitto brokers, and in many cases (maybe close to 50%) the message

Expiring client local.<bridge-name> due to timeout.

appears when Mosquitto is starting up. There is no actual connection timeout at play here - which happens within a second of service start. The bridge is then never actually created.

We're reverting to 2.0.14 to see if the issue is present there also; it may be a regression due to https://github.com/eclipse/mosquitto/commit/c99502a2567bf9147b3d811cb17e38ad51f08660 as @everactivetim suggested. The most relevant parts of the configuration are (probably)

# General
persistence true

# Bridge
cleansession false
start_type automatic
restart_timeout 10 120
vomov commented 2 years ago

This may be a different problem, just like the problem mentioned by @everactivetim , but they all appear related.

What device/architecture are you using? I'm able to reproduce the problem on a(ny) RPi Zero W, but not on any other device.

svet-b commented 2 years ago

Yes after further investigation I do think it's a separate issue, caused by a regression in 2.0.15. Reverting to 2.0.14 appears to fix the Expiring client due to timeout issue I was experiencing - I'll create a ticket for this. This is on Raspberry Pi 3B+ / armv7l architecture.

blusewang commented 11 months ago

5 years age! problem still here!

vomov commented 11 months ago

Should be fixed with 2.0.16, actually. This issue simply wasn't closed automatically.

thedmi commented 9 months ago

This issue is NOT fixed. We're seeing lost bridge connections on 2.0.18.

Expiring client cloud-bridging-client due to timeout.

mosquitto.conf:

connection cloud

address my-cloud-address:8883
bridge_cafile /certs/ca.pem

bridge_protocol_version mqttv50
notifications_local_only true

local_clientid cloud-bridging-client

remote_clientid the-client
remote_password the-password

topic some/topic both 1 cloud/ ""
tfazli commented 9 months ago

That's true! We use 2.0.18 & the issue is still present:

Expiring client local.*** due to timeout.

user root
persistence true
persistent_client_expiration 1d
persistence_location /mosquitto/data/
log_dest file /mosquitto/log/mosquitto.log

listener {LOCAL_PORT} {LOCAL_HOST}
allow_anonymous true

connection bridge-name
address {REMOTE_HOST}:{REMOTE_PORT}
bridge_protocol_version mqttv50
restart_timeout 1
remote_clientid other-name
berlincount commented 2 months ago

i'm running into this problem again and again as well. Restarting the broker is a bad workaround, but apparently the only one available.