DATA RACE: Disconnect()

khlystov commented 3 years ago

 WARNING: DATA RACE
 Read at 0x00c000390d40 by main goroutine:
   github.com/eclipse/paho%2emqtt%2egolang.(*client).Disconnect()
       /github.com/eclipse/paho.mqtt.golang/client.go:446 +0x2c4

 Previous write at 0x00c000390d40 by goroutine 23:
   github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
       /github.com/eclipse/paho.mqtt.golang/client.go:596 +0x82a
   github.com/eclipse/paho%2emqtt%2egolang.(*client).Connect.func1()
       /github.com/eclipse/paho.mqtt.golang/client.go:292 +0x5bb

Version: v1.3.5

MattBrittan commented 3 years ago

I've closed off a previous issue (#536) that has the same cause.

As per the comments on that issue its something I was aware of (there are a few other similar issues) - reusing the client after calling Disconnect is not fully threadsafe (for example if a connection attempt is in progress Disconnect may return before the connection function exits leading to this panic if you then call Connect). However as I currently have limited time and cannot see a use-case for this (the built in reconnection functionality should handle reconnections reliably for you) its not something I will be looking at in the short term (and, alternatively, you can just open a new Client instance). For the time being I have added a warning to the Disconnect function (and the Readme) so that users are alerted to the issue (and their options).

Note: The above makes some assumptions in regards to how you are using the library (calling Disconnect and the Connect using the same client instance). If that is not the case then please provide some sample code showing what you are doing (and the options in use etc).

Happy to accept a pull request fixing this. However I don't think it will be a simple fix; due to the way this library has evolved its difficult to maintain thread safety (really easy to introduce unintended deadlocks) and the way client.status is used is somewhat problematic (for historical reasons).

Matt

ptsneves commented 2 years ago

It seems this can happen without incorrect library usage if there is a reconnection. Specifically, I try to call Disconnect() when behind the scenes a reconnection happened.

WARNING: DATA RACE

Read at 0x00c000142680 by goroutine 9:

  github.com/eclipse/paho%2emqtt%2egolang.(*client).Disconnect()
      /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:446 +0x2c4

  work/tc.(*Client).Close()
      /home/ptsneves-agent-4_2mywork/client.go:147 +0x230

  work/.(*MQTT).run·dwrap·7()
      /home/ptsneves-agent-4_2mywork/mqtt.go:207 +0x39

  work/.(*MQTT).run()
      /home/ptsneves-agent-4_2mywork/mqtt.go:220 +0x69f

  work/.NewMQTT·dwrap·2()
      /home/ptsneves-agent-4_2mywork/mqtt.go:92 +0x71

Previous write at 0x00c000142680 by goroutine 95:

  github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
      /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:596 +0xb4c

  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:352 +0x586

  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1·dwrap·5()
      /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/client.go:510 +0x39

ptsneves commented 2 years ago

Ok i had a look that the code and it is very tricky. We have a mutex(connMu) protecting concurrent writing c.commsStopped = make(chan struct{}) but we do nothing about reading operations like close(c.commsStopped) and <-c.commsStopped

I believe the issue is that we use close(c.commsStopped) to unblock both stopCommsWorkers as well as Disconnect. Unfortunately, this trick then forces us to need to create a new channel and write it to a member of a shared resource. I think that c.commsStopped needs to be a more complex object than just a channel used creatively, due to the inability to set new ones. Did I understand correctly?

MattBrittan commented 2 years ago

You are right - the code is quite tricky (which is why I have been avoiding making changes - its very easy to break something and addressing this properly will not be a quick job). A lot of this is due to the way this library has evolved and attempting to maintain backwards compatibility. I don't use Disconnect personally (all of my applications connect at startup and retain it from there) so this is not an issue I face.

I think that your analysis is correct. In Disconnect we have:

case <-c.commsStopped:
            WARN.Println("Disconnect packet could not be sent because comms stopped")

Should a reconnect be in progress when Disconnect is called then there is a race with this line in startCommsWorkers:

c.commsStopped = make(chan struct{})

Note that there is a check in place in reconnect() that will catch many situations that could otherwise lead to this issue (note that Disconnect sets the status to disconnected at the top):

// Disconnect() must have been called while we were trying to reconnect.
if c.connectionStatus() == disconnected {
    if conn != nil {
        conn.Close()
    }
    DEBUG.Println(CLI, "Client moved to disconnected state while reconnecting, abandoning reconnect")
    return
}

So the situation you are encountering should be a rare occurrence (unless I'm missing something). I think the following would need to happen:

reconnect() is in progress and past the check c.connectionStatus() == disconnected
Disconnect called and proceeds to select
reconnect() has called startCommsWorkers which sets c.commsStopped (race!)

I implemented the c.commsStopped mechanism to work around other deadlocks but missed this potential race (obviously put some checks in place but not enough!).

I guess one solution would be to remove the check of c.commsStopped if c.options.AutoReconnect is true. That way the code would send the packet if the connection was up and otherwise would timeout and close the network link. This is not ideal but given how rare this set of circumstances is may be the best option?

Ideally the code handling status should be totally rewritten but this is a big job and it would be really easy to introduce new deadlocks...

ptsneves commented 2 years ago

Thank you for your fast and thorough answer

This is not ideal but given how rare this set of circumstances is may be the best option? We will try to measure this and if possible proceed with it.

Would you be open to accept refactoring PRs of this code done so it would become easier to reason? I found quite a lot of redundant code that makes certain races and execution flows less evident.

Hopefully, I would then submit a reduction of concurrent code interaction.

MattBrittan commented 2 years ago

@ptsneves thanks for the offer - I'll happily accept refactoring PRs that improve readability/resilience so long as they don't break existing usages. Note that it may take me a while to accept such PR's (review/testing often takes longer than implementation).

ptsneves commented 2 years ago

The error is always reproducible if running go test -race -run Test_autoreconnect -v

khlystov commented 2 years ago

@ptsneves let me know if you need some help with refactoring

ptsneves commented 2 years ago

After understanding the code better and specifically the Disconnect() method I find that just removing the following line fixes the issue without any real adverse effects:

    case <-c.commsStopped:
        WARN.Println("Disconnect packet could not be sent because comms stopped")

The only downside is that if the comms have stopped we will not know it and Disconnect will take at most the quiesce time. As this is a library user caller argument, the caller may just set the quiesce to 0 if it wants. Otherwise, that line just waits and gives a print. Would you accept a pull request to remove that line?

Do you even so want the check to happen if AutoReconnect is false?

MattBrittan commented 2 years ago

I will, with reservations, accept a pull request that comments this out (with the comment that it results in a data race when autoreconnect is enabled).

At some point we will need to fix the way status is handled and at that point this warning (and skipping the delay) should be re-introduced. Being able to tell if the disconnect packet was sent may be important in some use-cases (if the broker receives this it will discard any Will Message for the current session).

ptsneves commented 2 years ago

@MattBrittan Thank you. Honestly, I have a big refactoring and fat-cutting patch privately, but this issue does not seem solvable without ugly synchronization primitives for the commsStopped channel.

bokunodev commented 2 years ago

the data race seems tobe happen in client.onconnect callback. i stumbled upon this issue when debugging emitter-io/go and the data race only happen when i use onConnect.

MattBrittan commented 2 years ago

@bokunodev If setting an OnConnect handler leads to a datarace then the issue is probably with the handler you are setting (the only impact of setting the OnConnect handler is that the handler gets called, in a GoRoutine, when the connection is up). Note that I have rewritten the status handling code recently (in master but not yet included in a release) and this should significantly reduce the chance of there being a datarace within the library (but you can still easily cause one in your own code).

soypat commented 1 year ago

I've written a low level implementation without use of the Go runtime (no mutexes, channels, allocations, interface casting/asserting) at natiu-mqtt. Hopefully it can inspire a future v2 for this library?

MattBrittan commented 1 year ago

@soypat - see the v5 client for the effective v2 of this client. Currently that only supports v5 but the reality is that most brokers support v5 (and a v3 implementation in the same form would be relatively simple). I think there is room for range of approaches (a lot of users want something that handles all of the work for them, i..e auto reconnection etc, whereas others want full control over memory usage and maximum efficiency) and it's great to see another Go option!

I believe that this particular issue is probably fixed in the latest release (which significantly changes the way the connection state is managed). I don't expect there to be major changes to this library going forward (once persistence is implemented in the V5 client I'll probably move my systems over to that).

soypat commented 1 year ago

a lot of users want something that handles all of the work for them, i..e auto reconnection etc, whereas others want full control over memory usage and maximum efficiency

My take on this is that I want both! When I started using this library last week I was frustrated with the bugs it had. When I tried to dive deep into the code to see what the problem was I was bewildered by how tightly coupled everything was. There was really no way of navigating the project in a reasonable time.

So I built my own! Natiu as it stands is low level, but that's fine- anyone can build an extension on top of it to add, for example, a websocket implementation with autoreconnect features and all the bells and whistles. This way it is easier to maintain the original implementation, as it's simple and lightweight. This in turn makes it easy to implement extensions for it by third parties- it's a win win!

There's other succesful projects that follow this methodology, like goldmark- by doing so you really cut down on cost of maintenance. Goldmark today has 180 closed issues, 1 open issue. Suggest you try it out in one of your next projects! :smile:

MattBrittan commented 1 year ago

I'm going to close this off as the largish change to status handling should have resolved the issue (and a few similar problems) and this issue has been quiet for a while. If anyone does come across a similar problem them please raise a new issue (there have been significant changes to the code so comments in this issue are no longer relevant).

eclipse-paho / paho.mqtt.golang

DATA RACE: Disconnect() #550