How to deal with LWT and DISCONNECT reason

empicano / aiomqtt

The idiomatic asyncio MQTT client

https://aiomqtt.felixboehm.dev/

BSD 3-Clause "New" or "Revised" License

412 stars 73 forks source link

How to deal with LWT and DISCONNECT reason #28

Open sveinse opened 3 years ago

sveinse commented 3 years ago

I'm working out of the advanced example on the front page and was working on adding a last will testament (LWT). However, I noticed that asyncio_mqtt.Client with its usage of context manager is always cleaning up the connection by sending DISCONNECT. This has the effect that the LWT will not be sent out by the broker, when the program is handling any exceptions including keyboard interrupts. The only way to get LWT to work is to kill the program hard.

Consequently a "manual" LWT publish had to be added to ensure a LWT-like behavior on program exit. This was not immediately apparent, so it might be worth mentioning in the docs at some point.

I would propose that the asyncio_mqtt.Client.__aexit__ had some awareness (setable flag in class?) of the exit status and corresponding .disconnect() had support for non-zero exit code in order for the broker to sent an ordinary LWT.

# Create a LWT
will = Will("some/topic", payload="offline", qos=2, retain=False)

# Connect to the MQTT broker
client = Client("test.mosquitto.org", will=will)
await stack.enter_async_context(client)

# Implement manual LWT
stack.push_async_callback(client.publish, "some/topic", "offline")

frederikaalund commented 3 years ago

Thanks the posting this issue. Let me have a look. :)

However, I noticed that asyncio_mqtt.Client with its usage of context manager is always cleaning up the connection by sending DISCONNECT.

Yes, graceful disconnects are per design.

This has the effect that the LWT will not be sent out by the broker, when the program is handling any exceptions including keyboard interrupts. The only way to get LWT to work is to kill the program hard.

This is indeed the intended default behaviour. Specifically, we consider a keyboard interrupt an intended disconnect. Hence the server shouldn't send the will message. I emphasize default, because I think it's a good idea to change the API to allow for custom behaviour.

Short Note on the Will Feature

As far as I'm aware, the server should only send the will message on an unintended disconnect. Quoting the MQTT Version 5.0 Specification, section 3.1.2.5 Will Flag:

Situations in which the Will Message is published include, but are not limited to:

An I/O error or network failure detected by the Server.

The Client fails to communicate within the Keep Alive time.

The Client closes the Network Connection without first sending a DISCONNECT packet with a Reason Code 0x00 (Normal disconnection).

The Server closes the Network Connection without first receiving a DISCONNECT packet with a Reason Code 0x00 (Normal disconnection).

In other words, if the client disconnects gracefully, the server shouldn't send the will message. Per design, asyncio-mqtt is very likely to disconnect gracefully. You have to really try to get an ungraceful disconnect (e.g., with the SIGKILL signal). This is a good thing!

I guess that we can discuss at length what constitutes an "unintended disconnect". :) Let's defer that discussion for now.

See also: https://www.hivemq.com/blog/mqtt-essentials-part-9-last-will-and-testament/ Short quote from said article: "In MQTT, you use the Last Will and Testament (LWT) feature to notify other clients about an ungracefully disconnected client"

Back on Topic

I would propose that the asyncio_mqtt.Client.__aexit__ had some awareness (setable flag in class?) of the exit status and corresponding .disconnect() had support for non-zero exit code in order for the broker to sent an ordinary LWT.

That's a good idea. :+1: Specifically, reason code 0x04 (Disconnect with Will Message) sounds like it fits your use case.

I don't want to add this feature as a flag in the constructor. I think that we can have a more powerful API by instead letting the user choose:

When/if the reason code should change
What the reason code should change to

Maybe the simplest solution is to just add a reason_code keyword argument to the Client.disconnect method. Then, you can do someting like:

# Always exit with a non-standard reason code
stack.push_async_callback(client.disconnect, reason_code=ReasonCode.DisconnectWithWillMessage)

# Disconnect with a non-standard reason code on `KeyboardInterrupt`
@asynccontextmanager
def _disconnect()
    try:
        yield
    except KeyboardInterrupt:
        await client.disconnect(reason_code=ReasonCode.DisconnectWithWillMessage)
        raise

stack.enter_async_context(_disconnect)

This way, the when/if/what is completely up to the user.

I'm open to other API suggestions if you have some ideas. :)

sveinse commented 3 years ago

Thank you for a very prompt response. I generally agree with you, but I have some additional comments:

However, I noticed that asyncio_mqtt.Client with its usage of context manager is always cleaning up the connection by sending DISCONNECT.

Yes, graceful disconnects are per design.

Which is a good thing.

This has the effect that the LWT will not be sent out by the broker, when the program is handling any exceptions including keyboard interrupts. The only way to get LWT to work is to kill the program hard.

This is indeed the intended default behaviour. Specifically, we consider a keyboard interrupt an intended disconnect. Hence the server shouldn't send the will message. I emphasize default, because I think it's a good idea to change the API to allow for custom behaviour.

I can agree that keyboard interrupt can be defined as intended disconnect, but an arbitrary unhandled exception might not be. E.g. if some erroneous py code raises uncaught ValueError, the DISCONNECT with value 0 is sent and the broker accepts this as an expected disconnect and deletes the LWT, while the program has died with a traceback. This is not intended behavior IMHO.

I'm open to other API suggestions if you have some ideas. :)

Client.__aexit__ has access to the returning exception, so perhaps it should inspect the exception to determine what error code to DISCONNECT with? However, might there be any use cases where the user might want the exception to pass though the Client exit context but wants a successful DISCONNECT?

If someone needs to disconnect with other codes, perhaps its sufficient to let the user disconnect manually before leaving the context? The __aexit__ code is testing for already disconnected connection, so it should work fine.

This line is missing sending the function arguments to the mqtt Client. If this were fixed, the user can disconnect with any type of response. https://github.com/sbtinstruments/asyncio-mqtt/blob/master/asyncio_mqtt/client.py#L90

frederikaalund commented 3 years ago

If someone needs to disconnect with other codes, perhaps its sufficient to let the user disconnect manually before leaving the context?

That's what I'm suggesting. :) I think that could work for now.

This line is missing sending the function arguments to the mqtt Client. If this were fixed, the user can disconnect with any type of response. https://github.com/sbtinstruments/asyncio-mqtt/blob/master/asyncio_mqtt/client.py#L90

Yes, that's what I had in mind. Do you want to make a pull request?

EDIT: Accidentally sent the comment too early.

frederikaalund commented 3 years ago

E.g. if some erroneous py code raises uncaught ValueError, the DISCONNECT with value 0 is sent and the broker accepts this as an expected disconnect and deletes the LWT, while the program has died with a traceback. This is not intended behavior IMHO.

I think that this is application-specific. For some, the current graceful disconnect behaviour is what you want. For other applications, you'd want to customize this.

In any case, the proposed change will suit both use cases. :+1:

frederikaalund commented 3 years ago

Client.aexit has access to the returning exception, so perhaps it should inspect the exception to determine what error code to DISCONNECT with?

Technically, this is possible. :)

However, might there be any use cases where the user might want the exception to pass though the Client exit context but wants a successful DISCONNECT?

My guess is that this is what you want most of the time. That is just my guess, though. In any case, the proposed change allows you to customize this behaviour. :+1:

sveinse commented 3 years ago

Yes, that's what I had in mind. Do you want to make a pull request?

TBH I think you have an better idea on how you want to solve this than I do, ref below.

I think that this is application-specific. For some, the current graceful disconnect behaviour is what you want. For other applications, you'd want to customize this.

Interesting. I'm not going to dwell with this to any length, but I'm curious: How do you deal with crashes in your application seen from the other endpoint applications subscribing to the topics from this app? Can I assume that you're not using LWT?

In my interpretation of the MQTT standard, any internal failures is included under the unexpected loss of connection (logically, not the technical network). Simply because the program might end up in a state where it might not be able to perform its roles to MQTT and the remote subscribers. "Fail early, fail large". LWT exists precisely for those cases where the unexpected happen and you can bring a topic to a known state.

My guess is that this is what you want most of the time. That is just my guess, though. In any case, the proposed change allows you to customize this behaviour. 👍

I do understand that the other mqtt subscribers might not be interested in the vital signs of the client. E.g. delivering some sensor data. In this case a clean DISCONNECT is very handy. However, I suspect the LWT feature isn't used at all. LWT is all about the client's vital signs, and LWT is thus about exception handling and the returned status. Conversely, my claim is that LWT without some exception management and non-zero DISCONNECT makes little sense. (I did spend two full hours debugging the MQTT server and tracing in Wireshark before I figured out that it was the client context manager exit that was the culprit. It was not obvious.)

Anyways, this discussion is kind of a side track, and mostly about mqtt semantics. I've raised my concerns and now I'm done! As pointed out: the proposed ideas will do. Cheers!

frederikaalund commented 3 years ago

You raise some good points about the default behavior. :) I guess it would make sense to let the server know the reason (the reason code) for the disconnect so that it can activate the LWT behavior accordingly. But which exceptions should trigger this? I don't really want to maintain a list of intentional exceptions (like KeyboardInterrupt and SystemExit) but maybe that would be the right thing. Let me dwell on it a bit more.

I don't use LWT myself so I really appreciate your input in this area. Cheers! I also don't want to pull the BDFL card unnecessarily, so I encourage this discussion. On 28 Dec 2020, 19.06 +0100, Svein Seldal notifications@github.com, wrote:

Yes, that's what I had in mind. Do you want to make a pull request? TBH I think you have an better idea on how you want to solve this than I do, ref below. I think that this is application-specific. For some, the current graceful disconnect behaviour is what you want. For other applications, you'd want to customize this. Interesting. I'm not going to dwell with this to any length, but I'm curious: How do you deal with crashes in your application seen from the other endpoint applications subscribing to the topics from this app? Can I assume that you're not using LWT? In my interpretation of the MQTT standard, any internal failures is included under the unexpected loss of connection (logically, not the technical network). Simply because the program might end up in a state where it might not be able to perform its roles to MQTT and the remote subscribers. "Fail early, fail large". LWT exists precisely for those cases where the unexpected happen and you can bring a topic to a known state. My guess is that this is what you want most of the time. That is just my guess, though. In any case, the proposed change allows you to customize this behaviour. 👍 I do understand that the other mqtt subscribers might not be interested in the vital signs of the client. E.g. delivering some sensor data. In this case a clean DISCONNECT is very handy. However, I suspect the LWT feature isn't used at all. LWT is all about the client's vital signs, and LWT is thus about exception handling and the returned status. Conversely, my claim is that LWT without some exception management and non-zero DISCONNECT makes little sense. (I did spend two full hours debugging the MQTT server and tracing in Wireshark before I figured out that it was the client context manager exit that was the culprit. It was not obvious.) Anyways, this discussion is kind of a side track, and mostly about mqtt semantics. I've raised my concerns and now I'm done! As pointed out: the proposed ideas will do. Cheers! — You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

pichwo commented 3 years ago

hi - yes this encourages me .... ;-)

please let me drop my 2cents on this topic ...

it is vital for "server clients" to establish some sort of "life-cycle-topics".

a standard "real client" would on startup need to

1a) subscribe on a well kown "i-am-there" topic of the "server-client" (which might return further information-needed to continue) OR 1b) subscribe to some known topic which are regular heartbeats indicating "normal activity"

2a) publish to a well-known "are-you-there" topic of the "server-client" (maybe providing furhter information needed for next steps) OR 2b) do nothing

3) bail out on timeout if no messages on expected topics

4) subscribe AT-LEAST on last-will of "server-client" and go on since this indicates "stop immediately because all your work might be cancellable immediately" and/or "no use to wait for further protocol-activity" ....

LAST-WILL is some sort of "the other side gone insane". if protocols are robust for eventual reconnect and/or include strict timeout expectations, i admit last-will is unnecessary.

but think of "close interoperation", where 2 state-machines cannot be resyncronized with sane effort´ : a "hard reset" is more than welcome.

think also of events which occurr infrequently and possibly very seldom but sanity of both sides is simply relied on a keepalive session (indicates also the (im-)possibility of additional network-connections between peers relaying on mqtt-broker-host and simplifies timeout-logic on this fact a lot ;-)

and last not least think of "mqtt over autossh " were any security considerations are solved easily and since we are connecting to "known hosts" : we can omit all this tls stuff and may fall back to user-login on known shared secrets delegating all other things to ssh logic. maybe the fact of not yet having received last will after i-am-here can be utilized to be sure, that out-of-band data exchange over other known tcp-tunnels is still feasible ;-)

please keep also in mind, that in very common scenarios it is also desirable, that some "well-known" topics indicating a "soft-reset" and "i-am-shutting-down" are very common, eg. "server-side" says "had-to-reinitialize-this-and-that", or "client-side" wants "please-start-over-on-that-and-this".

LAST-WILL is just the ultimate and/or only message indicating "no more activity in current lifecycle start over please".

of course i am glad if i have a service which will also issue some "i-am-ready-now" which will follow the "i-am-there" which ideally eases initialization ... but i am also glad if i can rely on a last-will and do not have to implement unnecessary timeout logic on both sides.

i admit, protocols mapped either to simple i-am-there/last-will or including complex topic-wars are a matter of taste, but i would vote to support them ALL in your framework, which is very promising in its core "do it via context-managers and rely on exitstack".

there is no use to argue about "what is needed" or not, the framework should be open enough to COPE EASiLY and UNIVERSALLY. THIS is the key to its success, not philosophical things and their legitimacy ...

best greetings w.

... and last words : last will COULD be special insofar, that dynamic input crashed the other side and it is not feasible to continue any more. not just network timeouts. it is implementation-defined. no use to narrow the implemention by the framework ...

pichwo commented 3 years ago

one more last words ;-)

============== ============== ----------------- <TL,dr>

rationale: implementing "vpn for the paranoid poor" :

think of an existing "gateway-backbone" each running a special ssh gateway-client MONITOR (i.e. a python app utilizing asyncssh and reconnecting endlessly until stopped like AUTOSSH with hooks that will pub/sub to the mqtt-broker at localhost).

gateway-A --(whatever tunnels) --> central-gateway gateway-B --(whatever tunnels) --> central-gateway ...

the MONITOR would have to :

subscribe to a "ping"-topic on mqtt-localhost and publish a "pong"-topic on_message and on_ssh_client_connected_or_reconnected.
(re-) start a keepalive ssh-client carrying the WHATEVER-TUNNELS and dedicated additional tunnels to/from dedicated echo-server on gateway-host and reconnect if anything fails just as autossh does
to protect against coredump or sigkill issue "crashed" LAST-WILL before connecting to mqtt-broker on localhost
on ANY errors regarding mqtt-broker issues run the termination protocol (see below)
when encountering ANY irregularities in ssh restart or monitoring logic run termination protocol (see below)
when encountering any sigterm sigint KeyboardInterrupt SystemExit or unhandled exceptions run termination protocol (see below
when very unhappy and want to quit : issue os.kill(os.getpid(),signal.SIGKILL) to speed things really up
so, for the more patient programmer who thinks it is bad practice for an an app commit suicide but be pedantic with same effect
the termination protocol :
- optionally publish a "giveup" topic
- DEFINITELY ALSO publish the last-will topic "crashed" before disconnect from mqtt-broker :
- - NOTE: there is no semantic difference between exiting regularily or just die !!!
- exit such behaviour will reliably bump one level up ANY unhealthy or unsafe conditions ;-)

any CLIENT APP would have to :

regard ANY errors /w mqtt-broker fatal giveup
subscribe to "crashed" (last-will) and
regard ANY on_message "crashed" (last-will) a fatal error note :
subscribe to "pong" is necessary only if ever willing to publish "ping", but this functionality is superfluous
subscribe to "giveup" is necessary only if ever willing to distinguish between sudden death and exit protocol, but this functionality is superfluous

============== ============== ----------------- </TL,dr>

as a consequence of such design i raised a request for enhancement because it is VITAL in such a scenario to have proxy-support enabled :

for any gateway-X --(whatever tunnels) --> central-gateway it is vital to avoid long static lists which are a nightmare to maintain ....

so (whatever tunnels) boils down to ( -R ... -D ...)

hope this are really last words on last will ;-) wolfgang

frederikaalund commented 3 years ago

Thanks again to both of you for your input on this issue. Great to see keen interest on a technical topic such as this. :+1:

I've dwelt on this issue for a while. Now, I too think that the client should disconnect with the will message when the program raises an unintentional exception.

I vote that we do the following:

Maintain a list of intentional exceptions (like KeyboardInterrupt and SystemExit)
Check in Client.__aexit__ if the exit is intentional or not.
- If intentional, disconnect with reason code 0x00 (normal disconnect)
- If unintentional, disconnect with reason code 0x04 (disconnect with will message)
Add an intentional_exceptions keyword argument to Client.__init__. Use this to override the default list.

With the above changes, the client will per default disconnect with the will message if, e.g., a RuntimeError occurs.

Implementation-wise, it's a relatively simple change. This is a good thing. :)

I think this covers all the use cases mentioned in this issue thread so far. What do you think? Did I miss something? Let me know.

Again thanks for the input and discussion on this issue. :+1:

pichwo commented 3 years ago

Great. Thank you. hope to pip install soon ;-) wolfgang