espressif / esp-mqtt

ESP32 mqtt component
Apache License 2.0
603 stars 255 forks source link

Random Disconnect (IDFGH-1968) #102

Closed homeit-hq closed 4 years ago

homeit-hq commented 5 years ago

TAG: ESP-MQTT_FOR_IDF_3.1

Hi, i'm using this lib to connect to a MQTT broker using SSL.

The problem i'm facing is that sometimes i get random disconnects.

Sometimes i get TRANS_SSL: mbedtls_ssl_write error, errno=Success MQTT_CLIENT: Error write data or timeout, written len = -27648

other i get the TRANS_SSL: mbedtls_ssl_write error, errno=Bad file number MQTT_CLIENT: Error write data or timeout, written len = -27648

I (603531) MQTT CONN: 0x3ffe68d8 - Publish to  *****/************/******/reply | data = {"method":"status_ok","args":{"msg":"this_info"}}
D (603541) MQTT_CLIENT: mqtt_enqueue id: 61710, type=3 successful
D (603541) OUTBOX: ENQUEUE msgid=61710, msg_type=3, len=88, size=88
D (603551) MQTT_CLIENT: msg_type=6, msg_id=1
D (603551) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBREL
E (603561) TRANS_SSL: mbedtls_ssl_write error, errno=Bad file number
E (603561) MQTT_CLIENT: Error write data or timeout, written len = -27648
D (603801) MQTT_CLIENT: msg_type=5, msg_id=61710
D (603801) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBREC
E (603831) MQTT_CLIENT: Read error or end of stream
D (603831) TRANS_SSL: Cleanup mbedtls
I (603841) MQTT_CLIENT: Reconnect after 10000 ms
I (603841) MQTT CONN: Event: DISCONNECTED

This is the log (full verbose mode) for the "Success" error.

I (24871) MQTT CONN: 0x3ffe68d8 - Publish to  *****/************/******/reply | data = {"method":"some_info_reply","args":[{"int_id":1,"ext_id":*******,"n_locks":1,"n_keys":0},{"int_id":2,"ext_id":*******,"n_locks":1,"n_keys":1}]}
D (24881) MQTT_CLIENT: mqtt_enqueue id: 46950, type=3 successful
D (24891) OUTBOX: ENQUEUE msgid=46950, msg_type=3, len=181, size=660
D (25041) MQTT_CLIENT: msg_type=5, msg_id=49681
D (25041) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBREC
D (25081) MQTT_CLIENT: msg_type=5, msg_id=46950
D (25081) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBREC
D (25341) MQTT_CLIENT: msg_type=7, msg_id=49681
D (25341) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBCOMP
D (25341) MQTT_CLIENT: pending_id=46950, pending_msg_count = 11
D (25351) OUTBOX: DELETED msgid=49681, msg_type=3, remain size=574
D (25361) MQTT_CLIENT: Receive MQTT_MSG_TYPE_PUBCOMP, finish QoS2 publish
D (25391) MQTT_CLIENT: msg_type=7, msg_id=46950
D (25391) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBCOMP
D (25391) MQTT_CLIENT: pending_id=46950, pending_msg_count = 10
D (25391) OUTBOX: DELETED msgid=46950, msg_type=3, remain size=393
D (25401) MQTT_CLIENT: Receive MQTT_MSG_TYPE_PUBCOMP, finish QoS2 publish
D (25961) MQTT_CLIENT: msg_type=3, msg_id=3
D (25961) MQTT_CLIENT: Queue response QoS: 2
I (25961) MQTT_CLIENT: deliver_publish, message_length_read=61, message_length=61
D (25961) MQTT_CLIENT: Get data len= 22, topic len=33
I (26001) UI: LOCK
I (26001) MQTT CONN: 0x3ffe68d8 - Publish to *****/************/******/reply | data = {"method":"status_ok","args":{"msg":"this_info"}}
D (26011) MQTT_CLIENT: mqtt_enqueue id: 64355, type=3 successful
D (26011) OUTBOX: ENQUEUE msgid=64355, msg_type=3, len=86, size=479
D (26021) MQTT_CLIENT: msg_type=6, msg_id=3
D (26021) MQTT_CLIENT: received MQTT_MSG_TYPE_PUBREL
E (26031) TRANS_SSL: mbedtls_ssl_write error, errno=Success
E (26031) MQTT_CLIENT: Error write data or timeout, written len = -27648
I (26041) DOOR_TASK: received. MSG_id=78
E (26051) MQTT_CLIENT: Error to public data to topic= *****/************/******/reply, qos=2
I (26061) BOX_CONFIG: received. MSG_id=79
I (26061) UI: N locks = 3
I (26071) MQTT CONN: 0x3ffe68d8 - Publish to  *****/************/******/reply | data = {"method":"lock_info_reply","args":[{"mac":"00:00:00:00:00:02","int_id":[1],"time":1},{"mac":"00:00:00:00:00:03","int_id":[],"time":4},{"mac":"**:**:**:**:**:**","int_id":[2],"time":0}]}
D (26091) MQTT_CLIENT: mqtt_enqueue id: 29522, type=3 successful
D (26091) OUTBOX: ENQUEUE msgid=29522, msg_type=3, len=224, size=703
E (26921) MQTT_CLIENT: Read error or end of stream
D (26921) TRANS_SSL: Cleanup mbedtls
I (26921) MQTT_CLIENT: Reconnect after 10000 ms
I (26921) MQTT CONN: Event: DISCONNECTED
I (26951) MQTT TASK: State: DISCONNECTED
I (27011) IO: received. MSG_id=17
W (27011) IO mqtt status: status 0
I (27241) main: received. MSG_id=17
D (41921) MQTT_CLIENT: Reconnecting...
D (46931) TRANS_SSL: Connect to XXXXXXXXXXX:8883
D (46961) TRANS_SSL: Performing the SSL/TLS handshake...
D (49341) TRANS_SSL: Verifying peer X.509 certificate...
D (49341) TRANS_SSL: Certificate verified.
D (49341) TRANS_SSL: Cipher suite is TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384
D (49351) MQTT_CLIENT: Transport connected to mqtts://XXXXXXXXX:8883
I (49351) MQTT_CLIENT: Sending MQTT CONNECT message, type: 1, id: 0000
D (49651) MQTT_CLIENT: Connected
I (49651) MQTT CONN: Event: CONNECTED

I've checked and I'm only calling/using the lib through the same task. Can you help me? thanks

bverhoeven commented 5 years ago

I'm running into similar issues where tls_write eventually returns -27648, an mbedtls library error which basically says "context is invalid, destroy it and try again".

From some basic debugging I did I've found that this is seemingly caused by concurrent calls to tls_write, from multiple threads. I can easily reproduce the issue by sending two messages immediately after each other, the response to those requests don't make it through and the TLS connection breaks instantly, after which the MQTT client eventually reconnects.

It seems that we may both be facing the issue described in https://github.com/espressif/esp-mqtt/issues/90 and maybe others, and while not identical, I do think it basically boils down do the fact that the MQTT client is not thread-safe yet, yet does some thread operations itself.

It seems that this recent commit might fix that issue though as, if I understand it correctly, it introduces locking and prevents two threads/tasks from writing to it at the same time: https://github.com/espressif/esp-mqtt/commit/752953dc3be007cca4255b66a35d3087e61f6a54

Update: I've done some preliminary testing and it seems that this commit fixes the issue I was seeing, I am no longer able to break the connection by rapidly sending requests.

homeit-hq commented 5 years ago

More info on the issue:

@bverhoeven we could be on something...

I had two (similar, but different) issues.

First a little information:

The 1st problem was related to the async nature of the MQTT lib. I was publishing topics and as soon as the MQTT lib publish function returned my MQTT task was ready to receive other data to send.

The 2nd is worse, and not yet resolved.

The connection is dropping if I receive a QoS2 message (again) and during the 4-way handshake I decided to publish some data (again QoS2).

This event should only be triggered after completing the handshake of the selected QoS.

I will look at the commit you mention, and if it solves this second issue i will backport it to the ESP-MQTT_FOR_IDF_3.1 tag.

But I still think that a different approach is needed...

david-cermak commented 5 years ago

Hi @homeit-hq

It seems the issues you are experiencing are caused by the library not being thread safe in the referenced version. As @bverhoeven mentioned this is addressed in recent commits using mutex to protect api.

The version you are using with 3.1 has many issues which have been resolved for IDF 3.2 and it's not so easy to backport due to updates in tcp_transport which became a component in 3.2.

However the commit https://github.com/espressif/esp-mqtt/commit/752953dc3be007cca4255b66a35d3087e61f6a54 does seem to be cherry-picked. Have you tried to test it with this change?

david-cermak commented 5 years ago

FYI, let me attach a patch here applying this "locking" commit to ESP-MQTT_FOR_IDF_3.1. This seems to solve some problems, so might be useful for somebody...

0001-added-mqtt-api-locks-so-methods-can-be-executed-from.patch.txt

david-cermak commented 5 years ago

Hi @homeit-hq

Any update about this issue? Have you had a chance to test the patch? Or better than that, use IDF 3.2?

Xasin commented 4 years ago

Just checking in to say that I am experiencing very similar random disconnects in the ESP IDF v3.3 I'm not sure if this is a similar issue or due to the same underlying problem, since I do not use TLS.

I'm running a standard mosquitto broker on the Raspberry Pi, have tried connecting via MQTT and WebSockets, and on both connection types I get this error after a random number of publishes or message receiveds:

E (282679) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128
E (282679) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1

I do not send a lot of messages (it's somewhere in the 1Hz range), but it does appear fairly frequently and is quite a nuisance since reconnecting takes a while with default settings, and no way to change reconnect delay aside from tampering with the mqtt_config.h

I also sometimes notice that a lot of ESP32s disconnect simultaneously, however, this does NOT happen with any other MQTT client connected to the same broker. Something is definitely still a bit broken with this library, which is a real shame.

wijnsema commented 4 years ago

I have the same problem, disconnects every few hours. Log from my ESP:

I (1569482) MQTT_EXAMPLE: MQTT_EVENT_DATA
TOPIC=temperature/formatted/kantoor
DATA=21.6
E (1569492) MQTT_CLIENT: Read error or end of stream
I (1569492) MQTT_EXAMPLE: MQTT_EVENT_DISCONNECTED
I (1584502) MQTT_EXAMPLE: Other event id:7
I (1584512) MQTT_CLIENT: Sending MQTT CONNECT message, type: 1, id: 0000
I (1584512) MQTT_EXAMPLE: MQTT_EVENT_CONNECTED
I (1584522) MQTT_EXAMPLE: MQTT_EVENT_SUBSCRIBED, msg_id=37393
I (1584522) MQTT_EXAMPLE: MQTT_EVENT_DATA
TOPIC=system/datetime/formattedtime

Notice the Read error or end of stream.

At the same time on my Mosquitto log from my Raspberry Pi:

1570512658: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1570514459: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.
1570514913: Socket error on client ESP32_c10560, disconnecting.
1570514928: New connection from 10.1.1.199 on port 1883.
1570514928: New client connected from 10.1.1.199 as ESP32_c10560 (c1, k120).
1570516260: Saving in-memory database to /var/lib/mosquitto/mosquitto.db.

Notice the Socket error on client ESP32_c10560, disconnecting.

Using:

Moved away from knolleary/pubsubclient because that would lock up my ESP32. This library has it's own problems though...

wijnsema commented 4 years ago

Let's try test.mosquitto.org then. After 2 hours running fine, suddenly the watchdog:

I (8592362) MQTT_EXAMPLE: MQTT_EVENT_DATA
TOPIC=system/datetime/formattedtime
DATA=15:17
E (8597362) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E (8597362) task_wdt:  - IDLE0 (CPU 0)
E (8597362) task_wdt:  - IDLE1 (CPU 1)
E (8597362) task_wdt: Tasks currently running:
E (8597362) task_wdt: CPU 0: rssi
E (8597362) task_wdt: CPU 1: mqtt_task
E (8602362) task_wdt: Task watchdog got triggered. The following tasks did not reset the watchdog in time:
E        (8602362) task_wdt:  - IDLE0 (CPU 0)
E (8602362) task_wdt:  - IDLE1 (CPU 1)
E (8602362) task_wdt: Tasks currently running:
E (8602362) task_wdt: CPU 0: rssi
E (8602362) task_wdt: CPU 1: mqtt_task

What's going on here?

Xasin commented 4 years ago

@wijnsema, since this isn't an actual disconnect from MQTT it's probably worth another Issue. All disconnects I've experienced were instantaneous and did not affect the task scheduling.

Yours could be some kind of weird deadlock situation between the two threads, but since RSSI is involved it might not be MQTTs fault? Hard to say.

david-cermak commented 4 years ago

@wijnsema @Xasin Thanks for sharing these disconnecttion problems. Could you please try to manually update the esp-mqtt submodule to latest master and repeat your test? The IDF release 3.3 still referenced quite old commits and many bugs have been fixed since including incorrect reception of messages and also a deadlock when closing/reconnecting.

Will soon update submodules for all IDF releases as said https://github.com/espressif/esp-mqtt/issues/135

wijnsema commented 4 years ago

Thanks for picking this up @david-cermak! Just started testing with master of esp-mqtt, within esp-idf 3.3. Test running for 10 minutes now, so far so good...

Xasin commented 4 years ago

@david-cermak, I fear my ESP still has problems with the network. I've just had two disconnects close to each other.

Here's the log: E (282906) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128 E (282906) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1 W (282910) XAQTT: Disconnected from broker! I (293802) LZR::Core: Battery level: 4209 E (301062) MQTT_CLIENT: Error transport connect W (301063) XAQTT: Disconnected from broker! I (311812) LZR::Core: Battery level: 4191 I (313077) XAQTT: Subscribing to Lasertag/Players/BC.DD.C2.D0.63.F8/CFG/# I (313079) XAQTT: Subscribing to Lasertag/Players/BC.DD.C2.D0.63.F8/Sound/# I (313081) XAQTT: Subscribing to Lasertag/Players/BC.DD.C2.D0.63.F8/Ping I (313086) XAQTT: Reconnected and subscribed I (331610) LZR::Core: Battery level: 4186 I (351415) LZR::Core: Battery level: 4197 I (371199) LZR::Core: Battery level: 4248 I (389212) LZR::Core: Battery level: 4238 ... I (580013) LZR::Core: Battery level: 4208 E (590719) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128 E (590719) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1 W (590724) XAQTT: Disconnected from broker! I (599798) LZR::Core: Battery level: 4219 E (609076) MQTT_CLIENT: Error transport connect W (609077) XAQTT: Disconnected from broker!

I made sure to re-flash the ESP with the latest esp-mqtt version pulled from master. The ESP is connected to a WiFi hosted by a Raspberry Pi Zero, which is also running the MQTT broker (Mosquitto version 1.4.10) The connection is done via standard port (1883), no SSL No other MQTT clients ever have these disconnects with this broker, and the problem has occurred with other brokers as well.

The problem always seems to occur when sending a packet and can be exacerbated by either sending or receiving more packets in a short amount of time.

The packets I send/receive on a regular basis are mostly QOS 0, however, a few QOS 1 packets are present. I will try capping the subscribe and publish QOS to 0, but I'm afraid I don't know enough about the internals of the library to say what's going on.

david-cermak commented 4 years ago

@Xasin Can you please check logs from the broker to see the reason for disconnection? errno=128 means socket not connected so it looks like the mosquitto actively disconnects (for a reason), perhaps after publishing data (frequently) as you say the issue occurs when sending data.

Xasin commented 4 years ago

I'm checking the mosquitto log right now, just restarted it with log_type all I'm noticing already that the ESP does not send any PINGREQ to the server, but it is publishing just fine.

1570625334: Sending PUBLISH to ESP32_d063F8 (d0, q0, r0, m0, 'Lasertag/Players/BC.DD.C2.D0.63.F8/Ping', ... (4 bytes)) 1570625345: Received PINGREQ from mqtt_a91d6d8.724ad9 1570625345: Sending PINGRESP to mqtt_a91d6d8.724ad9 1570625349: Client ESP32_d063F8 has exceeded timeout, disconnecting. 1570625349: Socket error on client ESP32_d063F8, disconnecting. 1570625349: Sending PUBLISH to local.zero-mobile.test-bridge (d0, q1, r1, m38645, 'Lasertag/Players/BC.DD.C2.D0.63.F8/Connection', ... (0 bytes))

Idea: The ESP32 does not send any PINGREQ while publishing with normal Publish Perhaps the Mosquitto broker misinterprets this and triggers a Timeout, even though it's still receiving activity via the PUBLISH? If so, I'd just interject a regular PINGREQ every few seconds or so.

Although I am just noticing in the timestamps of the log ... A PUBLISH from the ESP happens every three seconds. The last PUBLISH before disconnect was 15 seconds before the TIMEOUT! It seems that something in the ESP library stops sending packets

david-cermak commented 4 years ago

This should not be an issue as control packets seem frequent enough so PINGREQs are not needed (see keep-alive spec)

So you are actually saying the messages are published (no errors) according to the ESP log, but some of them won't appear in the broker's log -- so will not be delivered, correct?

wijnsema commented 4 years ago

Just an interim score from my testlab: no disconnects after 4 hours, this version definitely seems to be an improvement!

I will keep this running for now.

Xasin commented 4 years ago

It seems so, yes. I have turned up the MQTT module's log level a bit to debug, and ...

D (272205) MQTT_CLIENT: mqtt_message_receive: read_len=45
D (272208) MQTT_CLIENT: mqtt_message_receive: transport_read():47 47
D (272211) MQTT_CLIENT: msg_type=3, msg_id=0
D (272214) MQTT_CLIENT: deliver_publish, message_length_read=47, message_length=47
D (272219) MQTT_CLIENT: deliver_publish: msg_topic_len=39
D (272222) MQTT_CLIENT: Get data len= 4, topic len=39, total_data: 4 offset: 0
D (274010) MQTT_CLIENT: mqtt_message_receive: first byte: 0x30
D (274010) MQTT_CLIENT: mqtt_message_receive: read "remaining length" byte: 0x2d
D (274011) MQTT_CLIENT: mqtt_message_receive: total message length: 47 (already read: 2)
D (274016) MQTT_CLIENT: mqtt_message_receive: read_len=45
D (274019) MQTT_CLIENT: mqtt_message_receive: transport_read():47 47
D (274023) MQTT_CLIENT: msg_type=3, msg_id=0
D (274026) MQTT_CLIENT: deliver_publish, message_length_read=47, message_length=47
D (274030) MQTT_CLIENT: deliver_publish: msg_topic_len=39
D (274034) MQTT_CLIENT: Get data len= 4, topic len=39, total_data: 4 offset: 0
E (282715) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128
E (282715) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
D (282719) MQTT_CLIENT: Reconnect after 100 ms
W (282721) XAQTT: Disconnected from broker!
D (282824) MQTT_CLIENT: Reconnecting...
I (284799) LZR::Core: Battery level: 4202

The ESP MQTT layer definitely stops sending out messages for 15 seconds, including any Keep-Alive signals. It does not seem like it is at fault though. I will potentially look at my own code (it might hang on something and stop sending messages for 15s), but even then the internal MQTT library should send a PING REQ in regular intervals to keep the connection alive.

There does not seem to be a packet delivery failure though, which makes this even weirder.

Xasin commented 4 years ago
I (275797) TEST: Sending ping req!
I (277587) XAQTT: Publishing to Lasertag/Players/BC.DD.C2.D0.63.F8/Ping
I (277588) TEST: Sending ping req!
I (279386) XAQTT: Publishing to Lasertag/Players/BC.DD.C2.D0.63.F8/Ping
I (279387) TEST: Sending ping req!
I (281208) XAQTT: Publishing to Lasertag/Players/BC.DD.C2.D0.63.F8/Ping
I (281209) TEST: Sending ping req!
E (282641) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128
E (282641) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
W (282645) XAQTT: Disconnected from broker!
I (284790) LZR::Core: Battery level: 4215

Nope, my own code seems to be working fine. I should probably check the ESP's return status for the publish function, but there should've been a debug log about a dropped packet due to some reason.

Edit I just ran a check on the return error of the mqtt publish function, and it does not return any error. It seems as if MQTT packets are silently dropped, and sending them isn't even attempted (since there were no debug messages about sending a packet) Maybe it's because I am publishing with QOS0, but then it would have to be a network error nomming them up - right now however it really just is the MQTT library silently dropping them and not sending a PINGREQ instead.

david-cermak commented 4 years ago

@Xasin This library does not silently drop packets, if no message appears in the log, it means published data written ok to the socket. One exception, though, when the client is not connected and QoS>0, then messages are not sent out to network, just stored in outbox (to be resent later) -- but not your case as the client IS connected.

Could you please insert a debug message to mqtt_write_data() to check the data are successfully written to the network? (Probably makes sense to have the debug/verbose level logging here -- will update the library!) I would expect the MQTT library writes successfully to the socket. If yes, then another step would be to check on the socket layer (enabling socket debugging... SOCKETS_DEBUG = LWIP_DBG_ON...)

wijnsema commented 4 years ago

After running for more than 22 hours without disconnects I dare to say this is stable now.

Hopefully the updates and fixes of this library quickly make it to the mainsteam versions of ESP-IDF because frankly, without them this library is unusable IMHO.

Thanks for helping @david-cermak and @Xasin good luck with fixing your disconnects.

Xasin commented 4 years ago

@david-cermak

I (273988) MQTT_CLIENT: Calling MQTT write...
D (273993) MQTT_CLIENT: MQTT Pub finished!
I (273994) TEST: Sending ping req!
V (274000) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274100) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274200) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274301) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274401) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274501) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274601) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274701) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274801) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (274901) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275001) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275101) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275201) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
I (275201) MQTT_CLIENT: Calling MQTT write...
D (275202) MQTT_CLIENT: Sent PING successful
D (275204) MQTT_CLIENT: PING sent
V (275306) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275407) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275507) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275607) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275707) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
I (275789) XAQTT: Publishing to Lasertag/Players/BC.DD.C2.D0.63.F8/Ping
D (275789) MQTT_CLIENT: Starting send with 4 bytes
I (275789) MQTT_CLIENT: Calling MQTT write...
D (275793) MQTT_CLIENT: MQTT Pub finished!
I (275795) TEST: Sending ping req!
V (275807) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (275907) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276007) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276107) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276207) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276307) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276407) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276507) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276607) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
V (276707) MQTT_CLIENT: mqtt_message_receive: transport_read(): no data or EOF
E (276788) MQTT_CLIENT: mqtt_message_receive: transport_read() error: errno=128
E (276789) MQTT_CLIENT: mqtt_process_receive: mqtt_message_receive() returned -1
D (276792) MQTT_CLIENT: Reconnect after 100 ms
W (276794) XAQTT: Disconnected from broker!
D (276897) MQTT_CLIENT: Reconnecting...

A first assessment of this shows that the MQTT library might not be at fault. Transmission of data happens all the way right up to the sudden disconnect, with all functions including mqtt_write_data being called appropriately, and not catching any errors. The last PING sent by this library was at 275200, the disconnect was at 276700 (about 1.6s apart), with the KeepAlive set to 5s

I'll try running the MQTT connection to a public test broker to rule out the server as the source of the problem - but keep in mind that no other MQTT client suffers these kinds of disconnects on the exact same broker.

Xasin commented 4 years ago

Alright, I think I gotta move to a different culprit when it comes to this problem. I've changed the connection to a different MQTT Server, which didn't seem to cause any further problems. Moreso I've used a different WiFi network that had access to the same (formerly faulty) server, and I haven't encountered any problems yet.

This could be a weird artefact of the WiFi network that I am using, during which the ESP does not actually lose WiFi connection, but does loose connection with the MQTT server itself.

For clarification, the WiFi AP I am using is the network hosted by the Raspberry Pi Zero itself, which is running in Dual AP/STA mode and is simultaneously connected to the local WiFi (giving it internet access) It could be this causes problems with the ESP. I'll have to play around more. It definitely explains the simultaneous disconnects of multiple ESPs at once.

Sorry for the inconvenience, and thanks a lot for the good help here. If anything comes up that definitively is the MQTT library, I will poke my nose back in!

david-cermak commented 4 years ago

Thank you @Xasin and @wijnsema for your updates, good to hear that random disconnects got resolved! Just a note about this manual submodule update -- this would work only for tcp and ssl transports, but not for websockets, which depend on some other fixes in IDF (component tcp_transport). If ws or wss transport needed please wait until referencing commits get merged/backported to IDF release branches.

david-cermak commented 4 years ago

Closing as this was resolved and the referring commit was backported to all IDF releases