Closed khlystov closed 1 year ago
I've closed off a previous issue (#536) that has the same cause.
As per the comments on that issue its something I was aware of (there are a few other similar issues) - reusing the client after calling Disconnect is not fully threadsafe (for example if a connection attempt is in progress Disconnect
may return before the connection function exits leading to this panic if you then call Connect
). However as I currently have limited time and cannot see a use-case for this (the built in reconnection functionality should handle reconnections reliably for you) its not something I will be looking at in the short term (and, alternatively, you can just open a new Client instance). For the time being I have added a warning to the Disconnect function (and the Readme) so that users are alerted to the issue (and their options).
Note: The above makes some assumptions in regards to how you are using the library (calling Disconnect
and the Connect
using the same client instance). If that is not the case then please provide some sample code showing what you are doing (and the options in use etc).
Happy to accept a pull request fixing this. However I don't think it will be a simple fix; due to the way this library has evolved its difficult to maintain thread safety (really easy to introduce unintended deadlocks) and the way client.status is used is somewhat problematic (for historical reasons).
Matt
It seems this can happen without incorrect library usage if there is a reconnection. Specifically, I try to call Disconnect() when behind the scenes a reconnection happened.
WARNING: DATA RACE
Read at 0x00c000142680 by goroutine 9:
github.com/eclipse/paho%2emqtt%2egolang.(*client).Disconnect()
/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:446 +0x2c4
work/tc.(*Client).Close()
/home/ptsneves-agent-4_2mywork/client.go:147 +0x230
work/.(*MQTT).run·dwrap·7()
/home/ptsneves-agent-4_2mywork/mqtt.go:207 +0x39
work/.(*MQTT).run()
/home/ptsneves-agent-4_2mywork/mqtt.go:220 +0x69f
work/.NewMQTT·dwrap·2()
/home/ptsneves-agent-4_2mywork/mqtt.go:92 +0x71
Previous write at 0x00c000142680 by goroutine 95:
github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:596 +0xb4c
github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:352 +0x586
github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1·dwrap·5()
/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:510 +0x39
Ok i had a look that the code and it is very tricky. We have a mutex(connMu) protecting concurrent writing c.commsStopped = make(chan struct{})
but we do nothing about reading operations like close(c.commsStopped)
and <-c.commsStopped
I believe the issue is that we use close(c.commsStopped) to unblock both stopCommsWorkers
as well as Disconnect
. Unfortunately, this trick then forces us to need to create a new channel and write it to a member of a shared resource. I think that c.commsStopped needs to be a more complex object than just a channel used creatively, due to the inability to set new ones. Did I understand correctly?
You are right - the code is quite tricky (which is why I have been avoiding making changes - its very easy to break something and addressing this properly will not be a quick job). A lot of this is due to the way this library has evolved and attempting to maintain backwards compatibility. I don't use Disconnect
personally (all of my applications connect at startup and retain it from there) so this is not an issue I face.
I think that your analysis is correct. In Disconnect
we have:
case <-c.commsStopped:
WARN.Println("Disconnect packet could not be sent because comms stopped")
Should a reconnect
be in progress when Disconnect
is called then there is a race with this line in startCommsWorkers
:
c.commsStopped = make(chan struct{})
Note that there is a check in place in reconnect()
that will catch many situations that could otherwise lead to this issue (note that Disconnect
sets the status to disconnected
at the top):
// Disconnect() must have been called while we were trying to reconnect.
if c.connectionStatus() == disconnected {
if conn != nil {
conn.Close()
}
DEBUG.Println(CLI, "Client moved to disconnected state while reconnecting, abandoning reconnect")
return
}
So the situation you are encountering should be a rare occurrence (unless I'm missing something). I think the following would need to happen:
reconnect()
is in progress and past the check c.connectionStatus() == disconnected
Disconnect
called and proceeds to select
reconnect()
has called startCommsWorkers
which sets c.commsStopped
(race!)I implemented the c.commsStopped
mechanism to work around other deadlocks but missed this potential race (obviously put some checks in place but not enough!).
I guess one solution would be to remove the check of c.commsStopped
if c.options.AutoReconnect
is true. That way the code would send the packet if the connection was up and otherwise would timeout and close the network link. This is not ideal but given how rare this set of circumstances is may be the best option?
Ideally the code handling status
should be totally rewritten but this is a big job and it would be really easy to introduce new deadlocks...
Thank you for your fast and thorough answer
This is not ideal but given how rare this set of circumstances is may be the best option? We will try to measure this and if possible proceed with it.
Would you be open to accept refactoring PRs of this code done so it would become easier to reason? I found quite a lot of redundant code that makes certain races and execution flows less evident.
Hopefully, I would then submit a reduction of concurrent code interaction.
@ptsneves thanks for the offer - I'll happily accept refactoring PRs that improve readability/resilience so long as they don't break existing usages. Note that it may take me a while to accept such PR's (review/testing often takes longer than implementation).
The error is always reproducible if running go test -race -run Test_autoreconnect -v
@ptsneves let me know if you need some help with refactoring
After understanding the code better and specifically the Disconnect() method I find that just removing the following line fixes the issue without any real adverse effects:
case <-c.commsStopped:
WARN.Println("Disconnect packet could not be sent because comms stopped")
The only downside is that if the comms have stopped we will not know it and Disconnect will take at most the quiesce time. As this is a library user caller argument, the caller may just set the quiesce to 0 if it wants. Otherwise, that line just waits and gives a print. Would you accept a pull request to remove that line?
Do you even so want the check to happen if AutoReconnect is false?
I will, with reservations, accept a pull request that comments this out (with the comment that it results in a data race when autoreconnect is enabled).
At some point we will need to fix the way status is handled and at that point this warning (and skipping the delay) should be re-introduced. Being able to tell if the disconnect packet was sent may be important in some use-cases (if the broker receives this it will discard any Will Message for the current session).
@MattBrittan Thank you. Honestly, I have a big refactoring and fat-cutting patch privately, but this issue does not seem solvable without ugly synchronization primitives for the commsStopped channel.
the data race seems tobe happen in client.onconnect
callback.
i stumbled upon this issue when debugging emitter-io/go
and the data race only happen when i use onConnect.
@bokunodev If setting an OnConnect
handler leads to a datarace then the issue is probably with the handler you are setting (the only impact of setting the OnConnect
handler is that the handler gets called, in a GoRoutine, when the connection is up).
Note that I have rewritten the status handling code recently (in master
but not yet included in a release) and this should significantly reduce the chance of there being a datarace within the library (but you can still easily cause one in your own code).
I've written a low level implementation without use of the Go runtime (no mutexes, channels, allocations, interface casting/asserting) at natiu-mqtt. Hopefully it can inspire a future v2 for this library?
@soypat - see the v5 client for the effective v2 of this client. Currently that only supports v5 but the reality is that most brokers support v5 (and a v3 implementation in the same form would be relatively simple). I think there is room for range of approaches (a lot of users want something that handles all of the work for them, i..e auto reconnection etc, whereas others want full control over memory usage and maximum efficiency) and it's great to see another Go option!
I believe that this particular issue is probably fixed in the latest release (which significantly changes the way the connection state is managed). I don't expect there to be major changes to this library going forward (once persistence is implemented in the V5 client I'll probably move my systems over to that).
a lot of users want something that handles all of the work for them, i..e auto reconnection etc, whereas others want full control over memory usage and maximum efficiency
My take on this is that I want both! When I started using this library last week I was frustrated with the bugs it had. When I tried to dive deep into the code to see what the problem was I was bewildered by how tightly coupled everything was. There was really no way of navigating the project in a reasonable time.
So I built my own! Natiu as it stands is low level, but that's fine- anyone can build an extension on top of it to add, for example, a websocket implementation with autoreconnect features and all the bells and whistles. This way it is easier to maintain the original implementation, as it's simple and lightweight. This in turn makes it easy to implement extensions for it by third parties- it's a win win!
There's other succesful projects that follow this methodology, like goldmark- by doing so you really cut down on cost of maintenance. Goldmark today has 180 closed issues, 1 open issue. Suggest you try it out in one of your next projects! :smile:
I'm going to close this off as the largish change to status handling should have resolved the issue (and a few similar problems) and this issue has been quiet for a while. If anyone does come across a similar problem them please raise a new issue (there have been significant changes to the code so comments in this issue are no longer relevant).
Version: v1.3.5