eclipse / paho.mqtt.golang

Other
2.73k stars 533 forks source link

Data race condition #657

Closed xeanhort closed 1 month ago

xeanhort commented 10 months ago

I'm using version 1.4.2 and I sporadically had a data race detection when using the package. As it's only related to the internal functions of this library, I guess it shouldn't be related to my code:

==================
WARNING: DATA RACE
Read at 0x00c000627bc1 by goroutine 307229:
  github.com/eclipse/paho.mqtt.golang/packets.(*FixedHeader).pack()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/packets/packets.go:267 +0x7a
  github.com/eclipse/paho.mqtt.golang/packets.(*PublishPacket).Write()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/packets/publish.go:47 +0x291
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms.func1()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:293 +0x1274
Previous write at 0x00c000627bc1 by goroutine 307196:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).resume()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:1087 +0x145e
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:339 +0x71d
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.1()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:560 +0x47
Goroutine 307229 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.startOutgoingComms()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:264 +0x3e4
  github.com/eclipse/paho%2emqtt%2egolang.startComms()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/net.go:394 +0xfb
  github.com/eclipse/paho%2emqtt%2egolang.(*client).startCommsWorkers()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:656 +0xdd8
  github.com/eclipse/paho%2emqtt%2egolang.(*client).reconnect()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:338 +0x6e4
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1.1()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:560 +0x47
Goroutine 307196 (running) created at:
  github.com/eclipse/paho%2emqtt%2egolang.(*client).internalConnLost.func1()
      /home/usr/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.2/client.go:560 +0x664
==================
MattBrittan commented 10 months ago

Read paho.mqtt.golang@v1.4.2/packets/packets.go:267 is:


func (fh *FixedHeader) pack() bytes.Buffer {
    var header bytes.Buffer
    header.WriteByte(fh.MessageType<<4 | boolToByte(fh.Dup)<<3 | fh.Qos<<1 | boolToByte(fh.Retain)) // Line 267
    header.Write(encodeLength(fh.RemainingLength))
    return header
}

Previous write is: paho.mqtt.golang@v1.4.2/client.go:1087 is:

if p.Qos != 0 { // spec: The DUP flag MUST be set to 0 for all QoS 0 messages
    p.Dup = true   // Previous write HERE
}

So it looks like this issue relates to the way the Dup flag on the publish packet is being resent upon reconnection. I'm guessing that you are using the memory store (the default)?

What I think is happening (difficult to be sure without logs) is:

  1. Connection is up
  2. Publish is called; it checks c.status.ConnectionStatus() and sends message to c.obound (which blocks because other messages are being processed)
  3. Connection drops before the message is picked up from the channel
  4. Connection comes back up and, as part of resume we attempt to resend the publish packet.
  5. The old OutgoingComms goroutine attempts to send the Publish (this will fail because the old connection is down)

If I'm correct then this is not a big concern; the OutgoingComms routine remains running until it's emptied all of it's input channels (this is deliberate to avoid deadlocks) but it's attempting to send to a closed connection (so will just get an error every time). As such the packet with questionable integrity is effectively being thrown away anyway. Generally I'd expect the OutgoingComms goroutine to exit quickly when the connection has dropped (establishing a new connection should take longer); I'm not really sure why its taking longer in your case. Potentially I could see it being due to:

Without the ability to replicate this (and/or logs) it's going to be difficult to fix. For historical reasons the library spins up quite a few goroutines and the interactions between them is complex (so it's easy to create new issues when fixing something like this).

One option would be to modify Publish such that it fails more quickly when the connection drops. Unfortunately as I'm not 100% sure if I'm right about the cause this may, or may not, address the issue!

xeanhort commented 10 months ago

Thank you for your feedback.

The connection is unstable but I saw this message a couple of times. I'll add the log handlers and try to reproduce it.

MattBrittan commented 1 month ago

I'm going to close this as it's been idle for quite some time (so I suspect the original issue has been found/resolved).