chirpstack / chirpstack-gateway-bridge

ChirpStack Gateway Bridge abstracts Packet Forwarder protocols into Protobuf or JSON over MQTT.
https://www.chirpstack.io
MIT License
422 stars 270 forks source link

Deadlock upon reconnection #103

Closed lglenat closed 5 years ago

lglenat commented 5 years ago

Is this a bug or a feature request?

This is a bug in Eclipse Paho v1.1.1.

What happened?

There is a possibility of deadlock if the MQTT connection drops while a message is being published (paho is writing to the socket and there is a write timeout.

The reason is as follows: the message token has been created when Publish() is called, message is being written to the socket, and the publishing go routine is waiting with token.Wait().

When calling token.Wait(), the lora-gateway-bridge go routine publishing the message acquires a lock on the message (code resides in the paho token.go file). If the mqtt connection drops, before any reconnection attempt, the paho client waits for workers (go routines) to terminate/exit. See client.go But for this to happen the mqtt publish worker needs to [https://github.com/eclipse/paho.mqtt.golang/blob/cb7eb9363b4469c601b1a714447653b720e4e43a/net.go#L165](set the error) on the message's token and for that acquire the lock, but it can't, resulting in a deadlock. Note that this only happens if

This is due to a bug in paho that was fixed in this commit. The issue is that this commit has not been released yet.

I tried using the master branch of Paho instead of v1.1.1 and it solved the issue. While waiting for the new paho version to be release, the token.Wait() call could be replaced by token.WaitTimeout() --> so that if the above scenario ever happens, the lock is released by lora-gateway-bridge after a while and the paho go routine can terminate properly.

What version are your using?

2.6.2

jyhyvari commented 5 years ago

This seems to be reproduced somehow with Eclipse Paho 1.2 version too.

What happened?

TCP break when Subscribe is ongoing causes onConnectionLost handler to be called but connection loop halts because function that calls Subscribe/Publish is waiting the token. Same could basically happen with Unsubscribe

Possible solution

Issue could be fixed with WaitTimeout. I'm not very familiar with the threading of bridge or go-language itself, but this shallowly tested solution seemed to work on this scenario. Please review

What version are you using?

3.0.1

deadlock-patch.diff.zip