RIOT-OS / RIOT

RIOT - The friendly OS for IoT
https://riot-os.org
GNU Lesser General Public License v2.1
4.94k stars 1.99k forks source link

cpu/esp8266: Tracking open problems of esp_wifi netdev driver #10861

Open gschorcht opened 5 years ago

gschorcht commented 5 years ago

Description

During the stress test of the esp_wifi module for esp8266, including de-authentication attacks, the following issues sporadically occurred

  1. Reconnecting may fail after deauthentication and lead to system crash while excessive traffic is being sent to the esp8266. If the AP send a deauthentication, esp8266 tries to reconnect automatically. If there is only normal network load, the reconnect works as expected. However, if excessive traffic is being sent to the esp8266, it cannot reconnect and tries to repeat it until the memory is exhausted and it crashes. The memory seems to be consumed by the Espressif SDK :worried:

    [esp_wifi] disconnected from ssid BSHS1, reason 7 (ASSOCED)
    [esp_wifi] heap: 15416 (used 5928, free 9488)
    [esp_wifi] disconnected from ssid BSHS1, reason 202 (FAIL)
    [esp_wifi] heap: 15416 (used 6128, free 9288)
    [esp_wifi] disconnected from ssid BSHS1, reason 2 (AUTH_EXPIRE)
    [esp_wifi] heap: 15416 (used 7568, free 7848)
    [esp_wifi] disconnected from ssid BSHS1, reason 2 (AUTH_EXPIRE)
    [esp_wifi] heap: 15416 (used 10576, free 4840)
    [esp_wifi] disconnected from ssid BSHS1, reason 2 (AUTH_EXPIRE)
    [esp_wifi] heap: 15416 (used 13584, free 1832)
    [esp_wifi] trying to reconnect to ssid BSHS1
    heap: 15416 (used 14936, free 480)
    E:M 40

    The problem might be related to problem 6.

  2. ~Send function may block completely on very heavy network load. Disconnecting and reconnecting helps sometimes but not always. Then, esp8266 has to be rebooted.~ Solved with PR #10862

  3. ~Sporadically, LoadProhibitedCause exception occurs on very heavy network load.~ Seems to be solved by PR #10869.

  4. ~GNRC packet buffer runs full on very heavy network load since packets are hanging in the packet buffer. The communication with the esp8266 is no longer possible. Packet buffer can be checked with command pktbuf using module gnrc_pktbuf_cmd.~ Seems to be solved by PR #10862.

  5. ~Sporadically, error message dev 1500 occurs on very heavy network load and esp8266 crashes after that with LoadProhibitedCause exception.~ Seems to be solved by PR #10869.

  6. Connecting to the access point while excessive traffic is being sent to the esp8266 often fails and a repetitive error message LmacRxBlk: 1 appear. esp8266 is then not usable at all and has to be reset. This might be related to problem 1 when trying to reconnect while excessive traffic is being sent to the esp8266.

    The problem can be reproduced if at least one host is pinging the esp8266 with the maximum data size and an intervall of 0 while esp8266 is trying to connect to the AP. Start pinging first and then reset the esp8266.

    According to network resources, error message LmacRxBlk:1 means that the internal MAC layer buffer has an overflow. The problem normally occurs when an interrupt service routing takes longer than the allowed 10 µs. It may also be that the esp8266 has a performance that is too low to handle such a large amount of frames while connecting, see https://github.com/peterhinch/micropython-mqtt/issues/3#issuecomment-354245006.

    From today's perspective, this problem can't be solved with the means provided by the SDK.

Steps to reproduce the issue

Ping one esp8266 node from three different machines with different data sizes as fast as possible:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s52 -i 0

Expected results

All these problems above only occur on very heavy network load. Under normal conditions esp_wifi is working stable, for example under following conditions:

term1> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s1392 -i 0.15
term2> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s512 -i 0.15
term3> sudo ping6 fe80::5ecf:7fff:fe80:3f08 -Ieth0 -s52 -i 0.05
benpicco commented 3 years ago

I noticed that with esp_now the esp8266 would lock up after a few minutes (when connected to a border router). It only prints

2021-05-03 15:13:17,642 # scandone
2021-05-03 15:13:27,381 # LmacRxBlk:0
2021-05-03 15:13:28,382 # LmacRxBlk:0
2021-05-03 15:13:29,383 # LmacRxBlk:0
2021-05-03 15:13:30,384 # LmacRxBlk:0
2021-05-03 15:13:31,384 # LmacRxBlk:0
2021-05-03 15:13:32,385 # LmacRxBlk:0
2021-05-03 15:13:33,386 # LmacRxBlk:0
2021-05-03 15:13:34,387 # LmacRxBlk:0
2021-05-03 15:13:35,387 # LmacRxBlk:0
2021-05-03 15:13:36,388 # LmacRxBlk:0

and does not react to shell input anymore.

(can be triggered by ping -f to the esp8266's address)