Open dehort opened 2 years ago
I think I will switch to using token.WaitTimeout() instead of token.Wait().
I will upgrade to the latest release and see if I notice the error again Unfortunately with the number of changes since
1.3.5
I'm not so keen to try tracing your issue in the old code base. Please let us know how you get on with the later release - ideally upgrade to @master because I've recently made some significant changes to the way statuses are handled (but wanted to leave it running a while before creating a new release). I've been running the@master
code in production for a month or so now and have not experienced any issues (but due to the number of options everyone's experience will differ!).
We do need to take another look at Publish
once the new status code is fully proven because I think it can be simplified/improved.
Unfortunately, I had encountered this issue before. After I upgraded to v1.3.5, it had appeared the day before yesterday
@MattBrittan, why token.Wait() don't redirect to token.WaitTimeout() with a default time?
why token.Wait() don't redirect to token.WaitTimeout() with a default time?
Because that's not what historical users would expect - this is quite an old library and we really don't want to introduce changes that might break existing code - if you want a timeout then use WaitTimeout()
(that's why it was added).
Please try @master
- as I mentioned above, this has a considerable number of changes to the way the connection status is handled (introduces extra connection status tracking and removes the use of atomic
which was somewhat error prone). There is still a potential issue in Publish
where the connection drops between the two status checks. I guess we could update the token at the second check in a way that mirrors the check at the top of the function (but would need more info on the issue you are facing in order to be able to say if that would help).
@MattBrittan, thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ?
@MattBrittan, thank for your response. I worry if will cause performance issue in high concurrences publish message case with WaitTimeout() ?
@apolloyang2017 well if the library was changed to do this by default then the performance impact would be the same.
I think it would be best if you raised a new issue; @dehort supplied a trace so it's clear what issue they were experiencing but you might be facing something different (without more info I'm just guessing). I cannot really comment on the performance impact but in general I would not expect creating a timer to be a huge draw (if you want you can get the channel with Done()
and then do something else yourself - e.g. maintaining a map of in-progress publishes and then checking periodically to see what has completed).
I wonder if the unexpected behavior I'm seeing is related to this issue:
Connect()
the client, end up in the connecting
state (I'm creating the client with opts.SetConnectRetry(true)
)token.Wait()
At this point, I would expect the client to send the message and token.Wait()
to return, but it keeps hanging. If I use qos=1, it works as expected, however. Did I miss somewhere in the documentation that states token.Wait()
when publishing with qos = 0 has different behavior between a connected and connecting/reconnecting client?
@nkvoll - there have been a number of releases since this issue was raised. The last release, 1.4.2, fundamentally changes that way the connection status is managed so it's really difficult to say if the two are related (as no logs were provided I don't know if the connection was being lost). Can you please raise a new issue and provide your test code (it's useful to see exactly what options you are using etc) and I'll try to add a test for this circumstance (checking the token in this one might do the job).
Significant changes (including a revamp of status handling)) have been made to the library since this issue was reported; I'd appreciate it if anyone still encountering the original issue could add a comment.
@MattBrittan We are seeing a similar behavior @nkvoll was seeing back in 2022:
token.Wait()
We believe this only happens when the device running the mqtt client loses and regains network connection, but are not 100% positive this is the root of the issue with the client.
We are using the most up to date version - v1.4.3
Here is a partial stack trace
[] github.com/MyRepo/common/src/mqtt.(*MqttClient).Publish(0x40004ec2c0, {0x400002eba0, 0x26}, {0x400063a580?, 0x400003f5b8?, 0x18130?}, 0x0, 0x0)
[] <autogenerated>:1 +0x2c fp=0x40004e34c0 sp=0x40004e34a0 pc=0x3cfc7c
[] github.com/eclipse/paho%2emqtt%2egolang.(*PublishToken).Wait(0x40000ba508?)
[] /go/pkg/mod/github.com/eclipse/paho.mqtt.golang@v1.4.3/token.go:77
[] github.com/eclipse/paho%2emqtt%2egolang.(*baseToken).Wait(...)
[] /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/chan.go:442 +0x14 fp=0x40004e34a0 sp=0x40004e3470 pc=0x17c64
[] runtime.chanrecv1(0x400047c200?, 0x40003fe6c0?)
[] /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/chan.go:583 +0x404 fp=0x40004e3470 sp=0x40004e33f0 pc=0x180a4
[] runtime.chanrecv(0x40003fe6c0, 0x0, 0x1)
[] /go/pkg/mod/golang.org/toolchain@v0.0.1-go1.22.0.linux-arm64/src/runtime/proc.go:402 +0xc8 fp=0x40004e33f0 sp=0x40004e33d0 pc=0x4c318
[] runtime.gopark(0x400003f488?, 0x1c5ec?, 0x0?, 0xe8?, 0x4ae8c0?)
[] goroutine 1 gp=0x40000021c0 m=nil [chan receive, 66 minutes]:
In lieu of this, we are wrapping the call to Publish as such:
func (c *MqttClient) Publish(topic string, payload []byte, qos byte, retain bool) error {
optionsReader := c.mqttClient.OptionsReader()
if token := c.mqttClient.Publish(topic, qos, retain, payload); (qos == 0 && token.WaitTimeout(optionsReader.WriteTimeout()) || qos > 0 && token.Wait()) && token.Error() != nil {
return token.Error()
}
return nil
}
Is this check of QOS 0 expected? Should token.Wait()
exit during client disconnection? Do we need to add custom code to the OnConnectHandler
to free all active tokens in Wait()
?
Thanks!
@jismithc - can you please raise this as a new issue (I really should close this one off as I believe the original issue was fixed by the change to the way the connection status is managed).
With QOS 0 messages the token will be closed when the message is transmitted, an attempt to publish it fails, or publishWaitTimeout
passes. If the the connection is dropped then there is a chance that the message will remain in the outbound channel (should then be picked up when the client reconnects). I've had a scan through the code, and it looks like the token should always be completed (as long as the connection comes back up eventually). However, based on what you are seeing, this is not necessarily the case so I'd like to gather as much info as possible to enable the issue to be traced (this kind of issue can be difficult to find and there is a chance the issue is in user code so really need logs or someway to duplicate the issue in order to trace it).
I recently ran into an issue where my paho based message publisher got into a state where it was blocked forever.
The process appears to have hung up on the token.Wait() call after Publish() was called. Here is a thread dump that shows the go routine blocked on Wait():
My application is currently using paho 1.3.5.
Looking at the code, it looks like the Publish() method checks to see if the connection is down at the top of the method. If the connection is down, then it marks the token as completed (flowComplete() or setError() as setError() calls flowComplete()) and returns the token. This all happens around line 690 in the code.
However, it looks like there is the possibility that the connection is good at the top of the method, but by the time the go routine makes it down to line 724 (which checks the connection status again) the connection is either "connecting" or "reconnecting". When this happens (connection status == "connecting" or "reconnecting"), then nothing marks the token as complete. This will cause the caller to block indefinitely on token.Wait().
It looks like the tokens could be marked as complete by the messageids.cleanUp() method:
This method appears to be called when the client explicitly calls Disconnect(): https://github.com/eclipse/paho.mqtt.golang/blob/v1.3.5/client.go#L483
In my case, the client should run indefinitely, so it will not be calling Disconnect().
messageids.cleanUp() is also called here.
However, the logic here appears to be such that messagesids.cleanUp() will not be called if paho is trying to reconnect. In my case, c.options.AutoReconnect is true, c.options.CleanSession is true and I suspect c.connectionStatus() > connecting is true as well (I don't have a way to prove this actually is the case for the stack dump provided). This appears to make it so that there is nothing that calls messageids.cleanUp().
To boil all of this down, it appears to be possible for the connection status to change between the top of Publish() and the bottom, when this happens the token is not marked as completed. This will result in go routine hanging on token.Wait().
I believe this explains why I saw my publisher hang on token.Wait(). However, I could easily be wrong here. :)
I can attach the paho related go routine stack traces if that would be helpful.
I see that master has changed the logic around the call to c.messageIds.cleanUp(). Should
cleanUp()
get called now when a connection has reconnected?I will upgrade to the latest release and see if I notice the error again (although it was very rare).