potential for token.Wait() to hang indefinitely

dehort commented 2 years ago

I recently ran into an issue where my paho based message publisher got into a state where it was blocked forever.

The process appears to have hung up on the token.Wait() call after Publish() was called. Here is a thread dump that shows the go routine blocked on Wait():

goroutine 138 [chan receive, 303 minutes]:
github.com/eclipse/paho%2emqtt%2egolang.(*baseToken).Wait(0x1)
    /opt/app-root/src/go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.3.5/token.go:73 +0x1f
github.com/RedHatInsights/cloud-connector/internal/mqtt.sendMessage({0x1259b28, 0xc0001246c0}, 0x1, {0xc0007633d0, 0x24}, 0xc0004e22e0, {0xc000a48b00, 0x3f}, 0x1, {0xdc16c0, ...})
    /go/src/app/internal/mqtt/utils.go:66 +0x37d
github.com/RedHatInsights/cloud-connector/internal/mqtt.SendReconnectMessageToClient({0x1259b28, 0xc0001246c0}, 0xc000752db8, 0xc000982520, 0x24, {0xc0007633d0, 0x24}, 0xc000459810)
    /go/src/app/internal/mqtt/utils.go:30 +0x2f8
github.com/RedHatInsights/cloud-connector/internal/cloud_connector.handleOnlineMessage(0x84efa6, {0x1259b28, 0xc0001246c0}, {0xc0007633d0, 0x24}, {{0xc00003ed98, 0x11}, {0xc00013d980, 0x24}, 0x1, ...}, ...)
    /go/src/app/internal/cloud_connector/message_handlers.go:125 +0x2e6
github.com/RedHatInsights/cloud-connector/internal/cloud_connector.handleConnectionStatusMessage(0xc0007535a0, {0x1259b28, 0xc0001246c0}, {0xc0007633d0, 0x24}, {{0xc00003ed98, 0x11}, {0xc00013d980, 0x24}, 0x1, ...}, ...)
    /go/src/app/internal/cloud_connector/message_handlers.go:94 +0x1f0
github.com/RedHatInsights/cloud-connector/internal/cloud_connector.HandleControlMessage.func1({0x1259b28, 0xc0001246c0}, {0xc0007633d0, 0x24}, {0xc000ac4f00, 0x65ff01})
    /go/src/app/internal/cloud_connector/message_handlers.go:67 +0x78e
main.handleMessage.func1(0xc0007c7320)
    /go/src/app/cmd/cloud-connector/kafka_message_consumer.go:217 +0x455
main.consumeMqttMessagesFromKafka(0x0, 0xc00015bc20, {0x1248370, 0xc000516980}, 0x0)
    /go/src/app/cmd/cloud-connector/kafka_message_consumer.go:238 +0x2e2
created by main.startKafkaMessageConsumer
    /go/src/app/cmd/cloud-connector/kafka_message_consumer.go:126 +0xaa9

My application is currently using paho 1.3.5.

Looking at the code, it looks like the Publish() method checks to see if the connection is down at the top of the method. If the connection is down, then it marks the token as completed (flowComplete() or setError() as setError() calls flowComplete()) and returns the token. This all happens around line 690 in the code.

However, it looks like there is the possibility that the connection is good at the top of the method, but by the time the go routine makes it down to line 724 (which checks the connection status again) the connection is either "connecting" or "reconnecting". When this happens (connection status == "connecting" or "reconnecting"), then nothing marks the token as complete. This will cause the caller to block indefinitely on token.Wait().

It looks like the tokens could be marked as complete by the messageids.cleanUp() method:

This method appears to be called when the client explicitly calls Disconnect(): https://github.com/eclipse/paho.mqtt.golang/blob/v1.3.5/client.go#L483

In my case, the client should run indefinitely, so it will not be calling Disconnect().

messageids.cleanUp() is also called here.

However, the logic here appears to be such that messagesids.cleanUp() will not be called if paho is trying to reconnect. In my case, c.options.AutoReconnect is true, c.options.CleanSession is true and I suspect c.connectionStatus() > connecting is true as well (I don't have a way to prove this actually is the case for the stack dump provided). This appears to make it so that there is nothing that calls messageids.cleanUp().

To boil all of this down, it appears to be possible for the connection status to change between the top of Publish() and the bottom, when this happens the token is not marked as completed. This will result in go routine hanging on token.Wait().

I believe this explains why I saw my publisher hang on token.Wait(). However, I could easily be wrong here. :)

I can attach the paho related go routine stack traces if that would be helpful.

I see that master has changed the logic around the call to c.messageIds.cleanUp(). Should cleanUp() get called now when a connection has reconnected?

I will upgrade to the latest release and see if I notice the error again (although it was very rare).

dehort commented 2 years ago

I think I will switch to using token.WaitTimeout() instead of token.Wait().

MattBrittan commented 2 years ago

I will upgrade to the latest release and see if I notice the error again Unfortunately with the number of changes since 1.3.5 I'm not so keen to try tracing your issue in the old code base. Please let us know how you get on with the later release - ideally upgrade to @master because I've recently made some significant changes to the way statuses are handled (but wanted to leave it running a while before creating a new release). I've been running the @master code in production for a month or so now and have not experienced any issues (but due to the number of options everyone's experience will differ!).

We do need to take another look at Publish once the new status code is fully proven because I think it can be simplified/improved.

apolloyang2017 commented 2 years ago

Unfortunately, I had encountered this issue before. After I upgraded to v1.3.5, it had appeared the day before yesterday

apolloyang2017 commented 2 years ago

@MattBrittan, why token.Wait() don't redirect to token.WaitTimeout() with a default time?

MattBrittan commented 2 years ago

why token.Wait() don't redirect to token.WaitTimeout() with a default time?

Because that's not what historical users would expect - this is quite an old library and we really don't want to introduce changes that might break existing code - if you want a timeout then use WaitTimeout() (that's why it was added).

Please try @master - as I mentioned above, this has a considerable number of changes to the way the connection status is handled (introduces extra connection status tracking and removes the use of atomic which was somewhat error prone). There is still a potential issue in Publish where the connection drops between the two status checks. I guess we could update the token at the second check in a way that mirrors the check at the top of the function (but would need more info on the issue you are facing in order to be able to say if that would help).

apolloyang2017 commented 2 years ago

@MattBrittan， thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ?

apolloyang2017 commented 2 years ago

@MattBrittan， thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ?

MattBrittan commented 2 years ago

@apolloyang2017 well if the library was changed to do this by default then the performance impact would be the same. I think it would be best if you raised a new issue; @dehort supplied a trace so it's clear what issue they were experiencing but you might be facing something different (without more info I'm just guessing). I cannot really comment on the performance impact but in general I would not expect creating a timer to be a huge draw (if you want you can get the channel with Done() and then do something else yourself - e.g. maintaining a map of in-progress publishes and then checking periodically to see what has completed).

nkvoll commented 1 year ago

I wonder if the unexpected behavior I'm seeing is related to this issue:

Stop the mosquitto server.
Connect() the client, end up in the connecting state (I'm creating the client with opts.SetConnectRetry(true))
Try to publish a message with qos=0, and it hangs in token.Wait()
Start the mosquitto server.
Client connects

At this point, I would expect the client to send the message and token.Wait() to return, but it keeps hanging. If I use qos=1, it works as expected, however. Did I miss somewhere in the documentation that states token.Wait() when publishing with qos = 0 has different behavior between a connected and connecting/reconnecting client?

MattBrittan commented 1 year ago

@nkvoll - there have been a number of releases since this issue was raised. The last release, 1.4.2, fundamentally changes that way the connection status is managed so it's really difficult to say if the two are related (as no logs were provided I don't know if the connection was being lost). Can you please raise a new issue and provide your test code (it's useful to see exactly what options you are using etc) and I'll try to add a test for this circumstance (checking the token in this one might do the job).

MattBrittan commented 10 months ago

Significant changes (including a revamp of status handling)) have been made to the library since this issue was reported; I'd appreciate it if anyone still encountering the original issue could add a comment.

jismithc commented 6 months ago

@MattBrittan We are seeing a similar behavior @nkvoll was seeing back in 2022:

mqtt client called to publish with Qos 0, and it hangs at token.Wait()

We believe this only happens when the device running the mqtt client loses and regains network connection, but are not 100% positive this is the root of the issue with the client.

We are using the most up to date version - v1.4.3

Here is a partial stack trace

[] github.com/MyRepo/common/src/mqtt.(*MqttClient).Publish(0x40004ec2c0, {0x400002eba0, 0x26}, {0x400063a580?, 0x400003f5b8?, 0x18130?}, 0x0, 0x0)
[]      <autogenerated>:1 +0x2c fp=0x40004e34c0 sp=0x40004e34a0 pc=0x3cfc7c
[] github.com/eclipse/paho%2emqtt%2egolang.(*PublishToken).Wait(0x40000ba508?)
[]      /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.3/token.go:77
[] github.com/eclipse/paho%2emqtt%2egolang.(*baseToken).Wait(...)
[]      /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/chan.go:442 +0x14 fp=0x40004e34a0 sp=0x40004e3470 pc=0x17c64
[] runtime.chanrecv1(0x400047c200?, 0x40003fe6c0?)
[]      /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/chan.go:583 +0x404 fp=0x40004e3470 sp=0x40004e33f0 pc=0x180a4
[] runtime.chanrecv(0x40003fe6c0, 0x0, 0x1)
[]      /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/proc.go:402 +0xc8 fp=0x40004e33f0 sp=0x40004e33d0 pc=0x4c318
[] runtime.gopark(0x400003f488?, 0x1c5ec?, 0x0?, 0xe8?, 0x4ae8c0?)
[] goroutine 1 gp=0x40000021c0 m=nil [chan receive, 66 minutes]:

In lieu of this, we are wrapping the call to Publish as such:

func (c *MqttClient) Publish(topic string, payload []byte, qos byte, retain bool) error {
    optionsReader := c.mqttClient.OptionsReader()
    if token := c.mqttClient.Publish(topic, qos, retain, payload); (qos == 0 && token.WaitTimeout(optionsReader.WriteTimeout()) || qos > 0 && token.Wait()) && token.Error() != nil {
        return token.Error()
    }
    return nil
}

Is this check of QOS 0 expected? Should token.Wait() exit during client disconnection? Do we need to add custom code to the OnConnectHandler to free all active tokens in Wait()?

Thanks!

MattBrittan commented 6 months ago

@jismithc - can you please raise this as a new issue (I really should close this one off as I believe the original issue was fixed by the change to the way the connection status is managed).

With QOS 0 messages the token will be closed when the message is transmitted, an attempt to publish it fails, or publishWaitTimeout passes. If the the connection is dropped then there is a chance that the message will remain in the outbound channel (should then be picked up when the client reconnects). I've had a scan through the code, and it looks like the token should always be completed (as long as the connection comes back up eventually). However, based on what you are seeing, this is not necessarily the case so I'd like to gather as much info as possible to enable the issue to be traced (this kind of issue can be difficult to find and there is a chance the issue is in user code so really need logs or someway to duplicate the issue in order to trace it).

eclipse-paho / paho.mqtt.golang

potential for token.Wait() to hang indefinitely #613