eclipse-mosquitto / mosquitto

Eclipse Mosquitto - An open source MQTT broker
https://mosquitto.org
Other
9.16k stars 2.41k forks source link

Upgrade from <= 2.0.15 to >= 2.0.16 breaks bridge first connection if persistence is enabled #3049

Open audricschiltknecht opened 7 months ago

audricschiltknecht commented 7 months ago

We struggled with a migration from 2.0.14 to 2.0.18 (version bump provided by our OS), where we would see mosquitto not connecting to the remote side of a bridge after the update. Restarting mosquitto would resolve the issue.

We managed to pinpoint it to a change between 2.0.15 and 2.0.16: if you have a bridge configured, persistence enabled and no clean session, then the first connection attempt after the upgrade will fail.

I've managed to produce the following repeatable scenario:

Using this configuration file:

# This instance is configured as a bridge to connect to another broker

listener 1883
allow_anonymous true
log_type all
log_dest stdout

# Enable persistence
persistence true
# persistence_location /var/lib/mosquitto/

# Bridge configuration
connection bridge
address test.mosquitto.org:1883
bridge_protocol_version mqttv50
remote_clientid RemoteClientId
try_private true

# Enable persistent session
cleansession false

# Configure notifications
notifications true
notifications_local_only false
notification_topic test-update-persistence/broker/state

## Topic mapping
topic # out 1 internal/ test-update-persistence/

start out with a 2.0.15 or below build:

1715028021: mosquitto version 2.0.14 starting
1715028021: Config loaded from mosquitto_bridge_minimal.conf.
1715028021: Opening ipv4 listen socket on port 1883.
1715028021: Opening ipv6 listen socket on port 1883.
1715028021: Bridge local.RemoteClientId doing local SUBSCRIBE on topic internal/#
1715028021: Connecting bridge (step 1) bridge (test.mosquitto.org:1883)
1715028021: mosquitto version 2.0.14 running
1715028021: Connecting bridge (step 2) bridge (test.mosquitto.org:1883)
1715028021: Bridge RemoteClientId sending CONNECT
1715028021: Received CONNACK on connection local.RemoteClientId.
1715028021: Bridge local.RemoteClientId sending UNSUBSCRIBE (Mid: 2, Topic: test-update-persistence/#)
1715028021: Received PUBACK from local.RemoteClientId (Mid: 1, RC:0)
1715028021: Received UNSUBACK from local.RemoteClientId
1715028081: Sending PINGREQ to local.RemoteClientId
1715028081: Received PINGRESP from local.RemoteClientId
1715028118: New connection from 127.0.0.1:48170 on port 1883.
1715028118: New client connected from 127.0.0.1:48170 as local_client (p2, c1, k60).
1715028118: No will message specified.
1715028118: Sending CONNACK to local_client (0, 0)
1715028118: Received PUBLISH from local_client (d0, q0, r0, m0, 'internal/test', ... (2 bytes))
1715028118: Sending PUBLISH to local.RemoteClientId (d0, q0, r0, m0, 'test-update-persistence/test', ... (2 bytes))
1715028118: Received DISCONNECT from local_client
1715028118: Client local_client disconnected.
1715028125: mosquitto version 2.0.14 terminating
1715028125: Saving in-memory database to mosquitto.db.

Process finished with exit code 0

Here everything works file, and the broker publishes its state to the remote bridge. We also have a local client (client id local_client in the log) that publishes a message that is received by a client subscribed to the bridge.

Then switch to a mosquitto build using any version >= 2.0.16:

1715028169: mosquitto version 2.0.18 starting
1715028169: Config loaded from mosquitto_bridge_minimal.conf.
1715028169: Opening ipv4 listen socket on port 1883.
1715028169: Opening ipv6 listen socket on port 1883.
1715028169: Bridge local.RemoteClientId doing local SUBSCRIBE on topic internal/#
1715028169: Connecting bridge (step 1) bridge (test.mosquitto.org:1883)
1715028169: mosquitto version 2.0.18 running
1715028170: Expiring client local.RemoteClientId due to timeout.

1715028182: New connection from 127.0.0.1:52110 on port 1883.
1715028182: New client connected from 127.0.0.1:52110 as local_client (p2, c1, k60).
1715028182: No will message specified.
1715028182: Sending CONNACK to local_client (0, 0)
1715028182: Received PUBLISH from local_client (d0, q0, r0, m0, 'internal/test', ... (2 bytes))
1715028182: Received DISCONNECT from local_client
1715028182: Client local_client disconnected.
1715028193: mosquitto version 2.0.18 terminating
1715028193: Saving in-memory database to mosquitto.db.

Process finished with exit code 0

Here mosquitto won't connect to the remote bridge and any message sent by the local client is not forwarded to the bridge side.

Restarting mosquitto makes everything go back to normal:

1715028202: mosquitto version 2.0.18 starting
1715028202: Config loaded from mosquitto_bridge_minimal.conf.
1715028202: Opening ipv4 listen socket on port 1883.
1715028202: Opening ipv6 listen socket on port 1883.
1715028202: Bridge local.RemoteClientId doing local SUBSCRIBE on topic internal/#
1715028202: Connecting bridge (step 1) bridge (test.mosquitto.org:1883)
1715028202: mosquitto version 2.0.18 running
1715028202: Connecting bridge (step 2) bridge (test.mosquitto.org:1883)
1715028202: Bridge RemoteClientId sending CONNECT
1715028203: Received CONNACK on connection local.RemoteClientId.
1715028203: Bridge local.RemoteClientId sending UNSUBSCRIBE (Mid: 2, Topic: test-update-persistence/#)
1715028203: Received PUBACK from local.RemoteClientId (Mid: 1, RC:0)
1715028203: Received UNSUBACK from local.RemoteClientId

The only differences between the 2 logs are the line Expiring client local.RemoteClientId due to timeout. when starting first after the update and the missing publication to the bridge.

I understand that this might not be considered as a bug but thought it would be helpful to some other users that are in the same scenario.

Daedaluz commented 7 months ago

I have a vague memory of a similar scenario a long time ago. https://github.com/eclipse/mosquitto/issues/2634

Relevant? Looks like there should be a fix in 2.0.16 2.0.17 and in 2.0.18 https://github.com/eclipse/mosquitto/commit/6d240a9d183413fca7d2a4f42eae803ba84af97c

audricschiltknecht commented 7 months ago

Yes, this looks like there are some similarities. #2634 affected us as well and was one of the reasons we were stuck on 2.0.14, and were waiting for a > 2.0.15 version to be available on our system.

However, I think #2434 was more systematic, while this issue really only occurs the first connection after an upgrade.