chirpstack / chirpstack-gateway-bridge

ChirpStack Gateway Bridge abstracts Packet Forwarder protocols into Protobuf or JSON over MQTT.
https://www.chirpstack.io
MIT License
423 stars 272 forks source link

3.14.3 Gets into a mqtt reconnection loop #226

Closed asanchezdelc closed 1 year ago

asanchezdelc commented 1 year ago

What happened?

The bridge enters into a reconnection loop, were it tries to connect multiple times in parallel, booting off already established connection in a goroutine. We are using Azure IoT Hub as backend and we never encounter this issue before. It seems that this might have been introduced in 3.14.3+.

The issue is resolved after restarting the chirpstack gateway bridge process.

What did you expect?

We expect the bridge to reconnect without any issues.

Steps to reproduce this issue

This is a rare one, we haven't been able to reproduce it. However, we noticed the same behavior across 3 different gateway manufacturers: Multitech, Tektelic, and Browan.

Could you share your log output?

time="2023-04-01T19:31:47.983487664Z" level=info msg="integration/mqtt: connected to mqtt broker"
 time="2023-04-01T19:31:47.984281264Z" level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
 time="2023-04-01T19:31:48.002510675Z" level=error msg="mqtt: connection error" error=EOF
 time="2023-04-01T19:31:48.648273374Z" level=warning msg="[store]    memorystore del: message 1382 not found" module=mqtt
 time="2023-04-01T19:31:48.754299477Z" level=info msg="integration/mqtt: subscribing to topic" qos=0 topic="devices/MASKED_GWY_14829/messages/devicebound/#"
 time="2023-04-01T19:31:49.429542874Z" level=warning msg="[store]    memorystore del: message 1383 not found" module=mqtt
 time="2023-04-01T19:31:52.142243147Z" level=info msg="integration/mqtt: connected to mqtt broker"
 time="2023-04-01T19:31:52.145047349Z" level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
 time="2023-04-01T19:31:52.154784595Z" level=info msg="integration/mqtt: subscribing to topic" qos=0 topic="devices/MASKED_GWY_14829/messages/devicebound/#"
 time="2023-04-01T19:31:52.161477279Z" level=error msg="mqtt: connection error" error=EOF
 time="2023-04-01T19:31:56.302178589Z" level=info msg="integration/mqtt: connected to mqtt broker"
 time="2023-04-01T19:31:56.304553271Z" level=error msg="integration/mqtt: subscribe gateway error" error="subscribe topic error: connection lost before Subscribe completed" gateway_id=MASKED_GWY_14829
 time="2023-04-01T19:31:56.307347433Z" level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
 time="2023-04-01T19:31:56.321877041Z" level=error msg="mqtt: connection error" error=EOF
 time="2023-04-01T19:31:56.416073216Z" level=info msg="integration/mqtt: subscribing to topic" qos=0 topic="devices/MASKED_GWY_14829/messages/devicebound/#"
 time="2023-04-01T19:31:56.977666026Z" level=warning msg="[store]    memorystore del: message 1384 not found" module=mqtt
 time="2023-04-01T19:31:56.982409069Z" level=error msg="A lookup for token 1384 returned nil\n" module=mqtt
 time="2023-04-01T19:31:59.504558228Z" level=warning msg="[store]    memorystore del: message 1385 not found" module=mqtt
 time="2023-04-01T19:32:00.179837224Z" level=info msg="integration/mqtt: publishing event" event=stats qos=0 stats_id=bfa8026a-36d9-4a6c-b8cb-688444bcba10 topic=devices/MASKED_GWY_14829/messages/events/stats
 time="2023-04-01T19:32:00.238854418Z" level=info msg="integration/mqtt: connected to mqtt broker"
 time="2023-04-01T19:32:00.24185854Z" level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
 time="2023-04-01T19:32:00.250236665Z" level=error msg="mqtt: connection error" error=EOF
 time="2023-04-01T19:32:00.315116423Z" level=info msg="integration/mqtt: subscribing to topic" qos=0 topic="devices/MASKED_GWY_14829/messages/devicebound/#"
brocaar commented 1 year ago

I'm not sure what caused this, but connection lost before Subscribe completed indicates that there was a connection issue while trying to subscribe to the MQTT topic. Also there is Connect comms goroutine - error triggered EOF. It looks like this was either an issue at the MQTT client (library) side, or the server is terminating the connection (causing the EOF).

I have just merged a PR which updates the Paho client library with some bugfixes. Maybe this fixes this issue too?

Btw, is this an unmodified version of the ChirpStack Gateway Bridge? Looking at the topic names devices/MASKED_GWY_14829/messages/devicebound/#, it looks like you have been making some modifications.

jburhenn commented 1 year ago

We suspect the issue might be related to the fix for the timeout here: https://github.com/chirpstack/chirpstack-gateway-bridge/commit/62c731607d034d57f24a86063caca5adb566aff9

If the connection is bad and the default 5 second timeout elapses in connect() the connectLoop() will try to reconnect. But if the connection still occurs after 5 seconds (because Paho has a 30 second timeout) it can end up with two clients dueling over the connection with the server constantly shutting them down as they each try to auto-reconnect while the other is already connected.

brocaar commented 1 year ago

Ah, that makes sense. Would it make sense to increase the token timeout in that case? Maybe to double the Paho timeout? The token timeout is a fallback to make sure things do not get stuck forever. Would you like to crate a PR for this fix?

jburhenn commented 1 year ago

Yeah, we are currently trying some tests with increasing the token wait in the toml file. I can make a PR with the default max_token_wait set to 60s if you'd like. Should I base it off master? We are currently still using the v3 API, would this change also make it into a v3 release? If so, would I need to make a PR based off a different branch than master for that?

I could also try disconnecting b.conn in connect() if it's not nil if you think that might be worth adding but not sure if that could cause other issues.

asanchezdelc commented 1 year ago

devices/MASKED_GWY_14829/messages/devicebound/#

No, I manually masked the gateway ID for privacy reasons.

philoliver commented 1 year ago

We are facing the same issue. One temporary solutions for us seems to be to set: terminate_on_connect_error = true When the issue occurs, the gateway bridge terminates. Monit then restarts it and the connection gets established well.

Is this a reasonable approach? Or can you @brocaar tell us if you will merge the proposed fix anytime soon? Do you think this will be fixed v4 Chirpstack gateway bridge? Thanks!

jxc876 commented 1 year ago

We are running 5,000 Tektelic Gateways in production and noticed the same issue occurring on about ~50. We are using EQMX as our message broker.

We noticed because we monitor the gateway CPU usage and noticed that the affected ones spiked to 90%.

We are on the latest version of the bridge client (3.14.6).

Our logs are similar to what was posted above:

# note: I masked our gateway (647fdaxxxxxxxxxx) and IP addresses

level=info msg="integration/mqtt: connected to mqtt broker"
level=info msg="integration/mqtt: subscribing to topic" qos=2 topic="gateway/647fdaxxxxxxxxxx/command/#"
level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
level=error msg="mqtt: connection error" error=EOF
level=error msg="integration/mqtt: subscribe gateway error" error="subscribe topic error: connection lost before Subscribe completed" gateway_id=647fdaxxxxxxxxxx
level=error msg="[client]   Connect comms goroutine - error triggered EOF" module=mqtt
level=info msg="integration/mqtt: subscribing to topic" qos=2 topic="gateway/647fdaxxxxxxxxxx/command/#"
level=info msg="integration/mqtt: connected to mqtt broker"
level=error msg="mqtt: connection error" error=EOF
level=warning msg="[store]    memorystore del: message 17666 not found" module=mqtt
level=info msg="integration/mqtt: publishing state" gateway_id=647fdaxxxxxxxxxx qos=2 state=conn topic=gateway/647fdaxxxxxxxxxx/state/conn
level=error msg="[client]   Connect comms goroutine - error triggered read tcp xx.xxx.xx.88:36623->xx.xx.xx.156:1883: read: connection reset by peer" module=mqtt

# ---

level=info msg="integration/mqtt: connected to mqtt broker"
# <repeats>
# ...
brocaar commented 1 year ago

The above fix #229 has been merged and I'll create a new v3 + v4 releases.