espressif / esp-mqtt

ESP32 mqtt component
Apache License 2.0
603 stars 255 forks source link

mqtt: Don't enqueue QoS>0 messages when publish in !MQTT_STATE_CONNEC… (IDFGH-4272) #177

Closed AxelLin closed 3 years ago

AxelLin commented 3 years ago

…TED state

Enqueue QoS>0 messages in !MQTT_STATE_CONNECTED state may take too much outbox memory before reaching OUTBOX_EXPIRED_TIMEOUT_MS. The ESP32 has limited memory, a task keep publishing in a busy loop during offline can easily make device run into OOM. Just make it return error if publish in wrong state no matter which QoS is used. With this change now esp_mqtt_client_publish returns -1 for QoS>0 when offline, this way the caller knows the publish failed.

Signed-off-by: Axel Lin axel.lin@ingics.com

david-cermak commented 3 years ago

Agree that this seems like a useful change for some applications, but it introduces a different behaviour and could break others. Maybe adding a config option to queue_if_disconnected or something could be a solution?

AxelLin commented 3 years ago

For QoS0, it simply returns -1 when in !MQTT_STATE_CONNECTED state. I have no idea why it has different behavior for QoS>0 cases. Why not return -1 so the behavior is consistent with QoS0? And the request may be silently deleted due to reach OUTBOX_EXPIRED_TIMEOUT_MS if it takes too much time to reconnect.

Actually, I think this is a bug fix. Consider about the mqtt broker shutdown or some netowrking issue casue connect broker failure for a period, Allow enqueue QoS>1 request can make device run into OOM. (And OOM usually won't make device panic or reboot, it just make device stop working)

BTW, I'm wondering why it could break some application. I don't think people expect esp_mqtt_client_publish to work when it's not connected to broker.

david-cermak commented 3 years ago

Thanks for your further inputs. Indeed you're right, the change is helpful, make sense and could prevent the device getting into OOM. I might even agree that the publish API is inconsistent and not very much usable to keep publishing when disconnected just to queue messages into the outbox.

But this all is still besides the point that this PR is a breaking change and we cannot simply modify API behaviour like that. Moreover when we typically backport the submodule reference to all IDF release branches.

AxelLin commented 3 years ago

The reason I sent this fix is because I think current behavior is buggy. If fix the behavior is not allow, I'll just close this pull request and checking the mqtt status in my application side. Actually I belive most people use the mqtt this way, set a flag in MQTT_EVENT_CONNECTED/MQTT_EVENT_DISCONNECTED and do publish only when it is connected.

I don't like adding queue_if_disconnected configuration because most people won't really understand why you need such configuration. In additonal, it does not address current issue at all. I think it currently works just because of good luck because if reconnect takes longer time it will hit OOM.

BTW, this is probably off-topic, but actually the baviour was changed by commit 6a0d1e7bff1e3085669d41ae3412c649db20108e "(support for qos1 and qos2 message retrasmission on reconnect"). In the initial version esp_mqtt_client_publish returns -1 for any QoS type. see esp_mqtt_client_publish() in commit 083f8789ac0c20e8efce66b96181379665a00ab4 "(Add support outbox, changed API").

david-cermak commented 3 years ago

I'm afraid we cannot accept this PR as is, unless a config option to enable/disable this change is added. Again I couldn't agree more with what you say:

Actually I belive most people use the mqtt this way...

most people won't really understand

The thing is that we cannot only think of most of the users, but all of them and maintain backward compatibility whenever possible. The config option might help and we can even turn it on on future releases.

BTW, this is probably off-topic, but actually the baviour was changed by ...

Not an off topic. Thanks for bringing this up! Yes, that probably wasn't the brightest decision, nor that was a breaking change at the time as it introduced queueing and resending for QoS>0, but this simply formed the API as is and as such needs to be maintained (at least for the stable IDF releases)

So please close this PR if you don't want to introduce additional configuration. Thank you for the contribution, anyway, we will have to address this issue in some way, plus maybe also add an event/notification on the deleted items from the outbox.

AxelLin commented 3 years ago

nor that was a breaking change at the time as it introduced queueing and resending for QoS>0, but this simply formed the API as is and as such needs to be maintained (at least for the stable IDF releases)

No, it's indeed a breaking change. When it introduce queueing and resending for QoS>0, it didn't provide an config option to disable the new beahvior. IMHO, even with an config option, the default should be "disable queueing and resending for QoS>0".

But since this change has been there for years, so the default will become "enable queueing and resending for QoS>0" if you are going to add this config option.