FreeRTOS / iot-reference-stm32u5

MIT License
43 stars 29 forks source link

[BUG] vMQTTAgentTask deadlock on re-subscribing after MQTT disconnection #81

Closed grdSTM closed 7 months ago

grdSTM commented 1 year ago

A user investigating the vMQTTAgentTask behavior upon MQTT broker disconnection faced a deadlock in xLockSubCtx().

The test-disconnections have been triggered by playing with the MQTT_KEEP_ALIVE_TIMEOUT i.e. disabling for a certain time handleKeepAlive() in the client code.

What happens is that:
- At power-on, the MQTT client connects to the broker and subscribes to update-accepted/rejected/delta topics.
- After the 1st broker disconnect-reconnect event, the topics are re-subscribed.
- After the 2nd broker disconnect-reconnect event, the topics are already present because the session is persistent and that doesn’t require a new re-subscription.

At this point, an attempt to re-take a mutex already taken (xLockSubCtx() call) and never released brings the vMQTTAgentTask in a deadlock.

A quick fix in the code helps to overcome this situation. 

See the patch they made over the v202205.00 version: https://github.com/FreeRTOS/iot-reference-stm32u5/compare/main...grdSTM:lab-iot-reference-stm32u5:resubDlWa?expand=1

As far as I understand, the house-keeping of subscription locks is intended to be handled by MQTTAgent_CancelAll() callbacks.

Edit: The issue has been detected in a port relying on an alternative implementation of the MQTT transport interface. Compared to with the Wi-Fi driver of the iot-reference-stm32u5 project, it is possible that pending TCP segments are handled differently when the TCP connection gets closed.

Is the above patch the proper fix, or are there preferred alternatives?

Thanks!

archigup commented 1 year ago

Hi, we're looking into this. Thanks for bringing this up!

AniruddhaKanhere commented 1 year ago

Hello @grdSTM,

Thanks a lot for taking the time to report the bug to us. We appreciate it :)

Trying to follow along - let me know if I miss something or did not understand correctly.

In case of clean session (After the 1st broker disconnect-reconnect event), the mutex is relinquished here. After which the agent does its job in the command loop here.

After this, the broker is disconnected from the client and we reacquire the mutex here. Following this, we restart from top of the loop where MQTT_Connect is called. This time however, the clean session flag is set to false and we call prvHandleResubscribe here. Which does NOT try to take the mutex and just relinquishes it in case of a failure.

I am not sure that I am seeing a deadlock here. Can you help e find my mistake? Did I miss something or overlook some case?

Thank you for reporting the bug to us.

Thanks, Aniruddha

gmarcolinosr commented 1 year ago

Hello @AniruddhaKanhere,

sorry to jump abruptly into the discussion, I had stumbled in the deadlock issue some time ago and notified the issue to @grdSTM. Referring to your comment, what i have seen is that after the 2nd disconnect-reconnect, the session flag is set to true because the session is persistent. That means that prvHandleResubscribe() is not called.

The temporary solution we adopted is to explicitly handle the case ( xSessionPresent == true) by unlocking the mutex, see the patch made over the v202205.00 version:

https://github.com/FreeRTOS/iot-reference-stm32u5/compare/main...grdSTM:lab-iot-reference-stm32u5:resubDlWa?expand=1

AniruddhaKanhere commented 7 months ago

The above PR was merged. Thank you for the contribution @grdSTM and @gmarcolinosr!

I shall be closing Issue now.