Open djangoa opened 6 months ago
Further to my initial comments, I've caught the select() timeout issue, with more debugging enabled (CONFIG_LWIP_DEBUG
, CONFIG_LWIP_ETHARP_DEBUG
, CONFIG_LWIP_SOCKETS_DEBUG
, CONFIG_LWIP_IP_DEBUG
, CONFIG_LWIP_TCP_DEBUG
), both during and after connecting to a MQTT broker:
timeout-pre-conn_debug.log timeout-post-conn_debug.log timeout_tcp_debug.log
In the last log, additionally TCP_INPUT_DEBUG
, TCP_OUPUT_DEBUG
is enabled and around Line 430 the device has a timeout. It looks like it's receiving broadcasts from another IP but not from the broker. So maybe if I'm interpreting the log correctly the TWT teardown is a red herring unless directly after the setting up TWT. I think I'm also suffering from https://github.com/espressif/esp-idf/issues/6607#issuecomment-799239990 where LWIP_TCP_TMR_INTERVAL
, LWIP_TCP_RTO_TIME
are set too high by default (750ms) for the first SYN-SENT to be retransmitted before my application invokes it's own mqtt_client event timeout. It would be good to know if there is any expectation of having a delay between tearing down a TWT and the wifi fabric becoming usable for connections?
Regarding the power issue I've adapted https://github.com/espressif/esp-idf/issues/4628#issuecomment-575535394 to suspend the lwip tcpip task before auto light sleep and wake it after. This has resolved the current spikes due to tcp_slowtmr
and etharp_timer
but I would have thought that auto light sleep should do this automatically unless there's some technical reason this shouldn't be done? I'm getting around 50uA, at the board level, in light sleep now:
I'm still experiencing issues though with auto light sleep being unreliable (with & without the suspending the lwip tcpip task) and sometimes consuming 16mA and other times 38mA:
On the topic of very high power consumption in auto light sleep after TWT setup, I'm seeing this with debugging on:
D (904564) sleep: RTC_PERIPH: AUTO(OFF)
...
D (904660) sleep: RTC_PERIPH: AUTO(OFF)
D (904669) sleep: RTC_PERIPH: AUTO(OFF)
D (904678) sleep: RTC_PERIPH: AUTO(OFF)
D (904687) sleep: RTC_PERIPH: AUTO(OFF)
D (904696) sleep: RTC_PERIPH: AUTO(OFF)
...
D (955206) sleep: RTC_PERIPH: AUTO(OFF)
Seems to get stuck in a loop until waking from auto sleep and occurs intermittently.
Hi @djangoa, Thanks for reporting. For the very high power consumption, Did you just run it on itwt example without adding other code? I will try to reproduce it in my testbed. And in previous tests, this auto-light sleep was normal.
@xuxiao111 Hi thanks for replying. No I've not yet tried isolating the issue in the itwt example. I think we really new a TWT example that's more realistic of how it might be used in an application that transfers data over wifi to catch a lot of these issues I'm having.
I tried isolating and couldn't replicate, I carefully looked through sdkconfig and found for some reason these were enabled:
CONFIG_PM_SLP_IRAM_OPT=y CONFIG_PM_RTOS_IDLE_OPT=y
Disabling them has resolved the intermittent high current draw I was experiencin, apologies for going off on a tangent.
In the process, I adapted the original itwt example to suspend/ resume the lwip tcpip task thus reducing current consumption to that advertised for the C6. I also enhanced event handling and added a loop to test setting up and tearing down TWT agreements. Might be useful for other devs to see how the logic of TWT compares to deep sleep and could be the basis for a future update to the existing trivial example:
The FIXME:
in my enhanced example line:396 vTaskDelay();
reproduces reliable a teardown bug, with 46875 << 13 service period, at least on/with my network/AP. With no delay between tearing down and setting up a TWT, the teardown fails with a ESP_ERR_INVALID_ARG:
E (32092) itwt: itwt teardown failed, err:0x102 ESP_ERROR_CHECK failed: esp_err_t 0x102 (ESP_ERR_INVALID_ARG) at 0x4200bbc2 file: "./main/itwt.c" line 382
either caused by the previous setup using a new flow_id or the teardown causing an internal wifi error:
E (32240) wifi:(itwt)twt id:0 is already exist, please change to another value
I assume this might be a timing issues with TWT and is probably related to my own application having intermittent connection timeouts after tearing down a TWT and trying to use wifi too quickly. I reiterate that it would be good to know if there is any expectation of having a delay between tearing down a TWT and the wifi fabric becoming usable for connections?
@xuxiao111: Hopefully that gives you more to go off in tracking things down.
@djangoa Hi, I just reproduce itwt-enchanced-example.zip in my testbed, and I found that we should call "xEventGroupClearBits(wifi_event_group, ITWT_SETUP_BIT | ITWT_TEARDOWN_BIT);" after "itwt setup and itwt teardown". After that you can remove "line:396 vTaskDelay();"
@xuxiao111 Sorry to waste your time, I should have been more careful :(
Could you also test this version that integrates the ping example: itwt-ping-example.zip
If you could try randomly resetting the device, in addition to with / without DISABLE_PM_WHEN_ACTIVE
and see if you notice any strange behaviour like I'm seeing, that would be great:
I (113436) itwt: <WIFI_EVENT_ITWT_TEARDOWN> flow_id 1
D (113409) wifi:mms: 0->0
I (113438) wifi:Set ps type: 0, coexist: 0
I (113441) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.1 icmp_seq=1 ttl=64 time=2 ms
D (113512) wifi:mms: 0->0
I (113691) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.1 icmp_seq=2 ttl=64 time=2 ms
D (113716) wifi:mms: 0->0
I (113941) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.1 icmp_seq=3 ttl=64 time=2 ms
D (114024) wifi:mms: 0->0
I (114192) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.1 icmp_seq=4 ttl=64 time=3 ms
D (114331) wifi:mms: 0->0
I (114439) itwt: <PING_EVENT_END> 4 packets transmitted, 4 received, 0% packet loss, time 9ms
I (114439) wifi:Set ps type: 1, coexist: 0
W (114513) wifi:not hit beacon in 65535 sp(s)
D (114513) wifi:mms: 0->2
D (114514) event: running post WIFI_EVENT:28 with handler 0x4200b40a and context 0x40833ff4 on loop 0x4087e5a8
I (114514) itwt: <WIFI_EVENT_ITWT_SETUP> twt_id:0, flow_id:1, trigger-enabled, announced, wake_dura:255, wake_dura_unit:0, wake_invl_e:13, wake_invl_m:46875
I (114515) itwt: <WIFI_EVENT_ITWT_SETUP> target wake time:2771322503168, wake duration:65280 us, service period:384000000 us
D (114517) sleep: RTC_PERIPH: AUTO(OFF)
D (114638) sleep: RTC_PERIPH: AUTO(OFF)
D (115442) sleep: RTC_PERIPH: AUTO(OFF)
D (124520) wifi:mms: 2->0
D (124521) event: running post WIFI_EVENT:29 with handler 0x4200b3a4 and context 0x40834024 on loop 0x4087e5a8
I (124570) itwt: <WIFI_EVENT_ITWT_TEARDOWN> flow_id 1
I (124571) wifi:Set ps type: 0, coexist: 0
D (124605) wifi:mms: 0->0
E (124773) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.1 icmp_seq=1 timeout
D (124913) wifi:mms: 0->0
E (125023) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.1 icmp_seq=2 timeout
D (125117) wifi:mms: 0->0
D (125220) wifi:mms: 0->0
E (125273) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.1 icmp_seq=3 timeout
E (125522) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.1 icmp_seq=4 timeout
D (125527) wifi:mms: 0->0
E (125572) itwt: <PING_EVENT_END> 4 packets transmitted, 0 received, 100% packet loss, time 797mn
If not then it's probably an issue with my AP or network and we can close. Thanks again for your support and time looking into this.
Also what does the "i" stand for in "itwt"? :)
Hi @djangoa, I have checked the itwt-ping-example.zip, seems like the ping timeout time is too small, advises adding it to 1000ms, and try it again.
@xuxiao111 Hi again, I reduced the timeouts so you don't have to sit and wait for a long time for the intermittent issue to arise. I've tried with higher values the same still happens just the that timeout lasts longer; 1 second or 10 seconds etc. it doesn't matter.
The physical and the data link layer of the Wifi doesn't seem to either retransmit packets or able to receive for long durations. My expectations are that maybe missing one or two pings is normal but to get 100% packet loss over consecutive pings is not and does not happen to other devices attached to the AP.
Hi @djangoa, Can you help capture the ping frame sent by ESP32C6 (Wireshark)? That we can know whether the AP received the ping request but did not send the ping reply, or the station could not send any frame during this period.
And the 'i' stand means individual TWT.
Hi @xuxiao111,
Apologies for the late reply. I have ssh remote captured from the Ubiquity AP using tcpdump and the packets never arrive as shown below. My observations; resetting the device seems to cause the intermittent connectivity issue to occur more regularly often with no packets getting through. Other devices attached to the same SSID on that AP I have tested have ~<0.1% loss with 99.9% of RTTs under 50ms (this is a local LAN after all).
I (2008) esp_netif_handlers: sta ip: 192.168.2.20, mask: 255.255.255.0, gw: 192.168.2.1
D (2009) event: running post IP_EVENT:0 with handler 0x4200b6e2 and context 0x40833fc4 on loop 0x4087e5a8
W (2083) wifi:not hit beacon in 65535 sp(s)
D (2083) wifi:mms: 0->2
D (2084) wifi:mms: 2->2
D (2085) event: running post WIFI_EVENT:28 with handler 0x4200b40a and context 0x40833ff4 on loop 0x4087e5a8
I (2085) itwt: <WIFI_EVENT_ITWT_SETUP> twt_id:0, flow_id:1, trigger-enabled, announced, wake_dura:255, wake_dura_unit:0, wake_invl_e:13, wake_invl_m:46875
I (2086) itwt: <WIFI_EVENT_ITWT_SETUP> target wake time:3348043399168, wake duration:65280 us, service period:384000000 us
D (2089) wifi:mms: 2->0
D (2090) event: running post WIFI_EVENT:29 with handler 0x4200b3a4 and context 0x40834024 on loop 0x4087e5a8
I (2090) itwt: <WIFI_EVENT_ITWT_TEARDOWN> flow_id 1
I (2091) wifi:Set ps type: 0, coexist: 0
W (2108) wifi:<ba-add>idx:0, ifx:0, tid:6, TAHI:0x1000fd3, TALO:0x6d322260, (ssn:2, win:64, cur_ssn:2), CONF:0xc0006005
I (2123) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.2 icmp_seq=1 ttl=128 time=30 ms
W (2126) wifi:<ba-add>idx:1, ifx:0, tid:0, TAHI:0x1000fd3, TALO:0x6d322260, (ssn:1, win:64, cur_ssn:1), CONF:0xc0000005
D (2157) wifi:mms: 0->0
I (2368) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.2 icmp_seq=2 ttl=128 time=26 ms
D (2464) wifi:mms: 0->0
D (2567) wifi:mms: 0->0
D (2771) wifi:mms: 0->0
E (2792) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.2 icmp_seq=3 timeout
D (2976) wifi:mms: 0->0
E (3042) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.2 icmp_seq=4 timeout
D (3079) wifi:mms: 0->0
E (3092) itwt: <PING_EVENT_END> 4 packets transmitted, 2 received, 50% packet loss, time 454mn
I (3092) wifi:Set ps type: 1, coexist: 0
W (3165) wifi:not hit beacon in 65535 sp(s)
D (3165) wifi:mms: 0->2
D (3165) event: running post WIFI_EVENT:28 with handler 0x4200b40a and context 0x40833ff4 on loop 0x4087e5a8
I (3166) itwt: <WIFI_EVENT_ITWT_SETUP> twt_id:0, flow_id:1, trigger-enabled, announced, wake_dura:255, wake_dura_unit:0, wake_invl_e:13, wake_invl_m:46875
I (3167) itwt: <WIFI_EVENT_ITWT_SETUP> target wake time:3348044513280, wake duration:65280 us, service period:384000000 us
D (3169) sleep: RTC_PERIPH: AUTO(OFF)
D (3266) sleep: RTC_PERIPH: AUTO(OFF)
D (4095) sleep: RTC_PERIPH: AUTO(OFF)
D (13172) wifi:mms: 2->0
D (13173) event: running post WIFI_EVENT:29 with handler 0x4200b3a4 and context 0x40834024 on loop 0x4087e5a8
I (13222) itwt: <WIFI_EVENT_ITWT_TEARDOWN> flow_id 1
I (13223) wifi:Set ps type: 0, coexist: 0
D (13261) wifi:mms: 0->0
I (13272) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.2 icmp_seq=1 ttl=128 time=48 ms
D (13363) wifi:mms: 0->0
D (13670) wifi:mms: 0->0
E (13674) itwt: <PING_EVENT_TIMEOUT> From 192.168.2.2 icmp_seq=2 timeout
I (13734) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.2 icmp_seq=3 ttl=128 time=9 ms
D (13978) wifi:mms: 0->0
I (13987) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.2.2 icmp_seq=4 ttl=128 time=13 ms
E (14224) itwt: <PING_EVENT_END> 4 packets transmitted, 3 received, 25% packet loss, time 269mn
I (14224) wifi:Set ps type: 1, coexist: 0
W (14297) wifi:not hit beacon in 65535 sp(s)
D (14297) wifi:mms: 0->2
D (14297) wifi:mms: 2->2
D (14298) event: running post WIFI_EVENT:28 with handler 0x4200b40a and context 0x40833ff4 on loop 0x4087e5a8
I (14299) itwt: <WIFI_EVENT_ITWT_SETUP> twt_id:0, flow_id:1, trigger-enabled, announced, wake_dura:255, wake_dura_unit:0, wake_invl_e:13, wake_invl_m:46875
I (14300) itwt: <WIFI_EVENT_ITWT_SETUP> target wake time:3348055556096, wake duration:65280 us, service period:384000000 us
D (14302) sleep: RTC_PERIPH: AUTO(OFF)
D (15232) sleep: RTC_PERIPH: AUTO(OFF)
Hi @djangoa, Thank you for your reply, but this captured wired packet cannot determine whether C6 did not send a ping request. Some APs may think that the STA is still in the TWT state and do not reply to the STA.
I don't have the equipment to capture real Wifi frames between the device and AP as per: https://wiki.wireshark.org/CaptureSetup/WLAN:
netsh wlan show wirelesscapabilities
...
Network monitor mode : Not supported
Hence my confusion in sending you a wire capture from the AP but at least you can omit other network related issues causing the packet loss. Did you try to replicate with?:
Could you also test this version that integrates the ping example: itwt-ping-example.zip
Regarding:
Some APs may think that the STA is still in the TWT state and do not reply to the STA.
Which APs have been tested so it can be documented here for when other developers evaluate whether to use TWT in a future product? Is there no workaround for that before returning during tear down of the TWT agreement? As it stands esp_wifi_sta_itwt_teardown()
returns instantly:
I (3152444) wireless: itwt teardown, flow_id: 8, err: 0x0
I (3152445) wireless: WIFI_EVENT_ITWT_TEARDOWN flow_id 8(all twt)
while esp_wifi_sta_itwt_setup()
takes ~70ms to return.
I hope you can see that this issue dramatically impacts battery operated devices with a limited energy budget and given https://github.com/espressif/esp-idf/issues/13312#issue-2168108301 / https://github.com/espressif/esp-idf/issues/13310#issue-2167628047, I don't see any feasible way to use the current TWT implementation in that context, unless we can find a way to address this issue.
Further to above when seeing the issue, calling esp_wifi_get_tx_statistics()
reports nothing was transmitted:
I (2933) wireless: Get tx statistics
I (2934) wireless: [aci:2] tb(suc:0, ack:0, err:0), count(suc:0, ack:0, err:0, tot:0, max_sent:0)
I (2935) wireless: [aci:2] enable:0, complete:0, tb_times:0, tb_last:0, edca:0, succ:0, fail(0,0,0, cts:0/nan%, ack:0/nan%, tot:0, nan%), edca(ack:0, ba:0), tb(hw-ba:0, sw-ba:0)
I (2937) wireless: [aci:2] txFrames:0, s-mpdu:0(nan%), bitmap(max:0, min:0, tot:0, avg:nan), retry(edca:0, tb:0, nan%), collision:0, timeout:0
I (2939) wireless: [aci:2] seqno_rtt[0,0], hw_rtt[0, 0], muedca[enable:0, times:0, 0.00, 0.00, tot:0.00], avg:nan ms, tot:0.000 secs
so looks like "the station could not send any frame during this period"
Hi @djangoa, Sorry for the late response, we already do the ping example itwt-ping-example.zip in our testbed, and the AP we used is AUSU-56U. And it works fine in our testbed, you can take a look at the log files: twt_ping_example.log
Also, we call the esp_wifi_get_tx_statistics() after the ping test is complete. Can you tell us which AP you are using? And we can use the same AP in our testbed.
Hi @xuxiao111,
Thanks for getting back in touch. The AP is a: Ubiquiti U6-Enterprise.
I can see you're using ~500ms service period:524288 us
, can you try with a longer wake interval to replicate a more sensible use case? This might trigger the issue where I see no txFrames in the statistics when my dev board resets.
I also see strange behaviour in your log file where twt setup is failing many times and causing resets, I've also experienced this issue:
E (25283) wifi:(itwt)twt id:0 is already exist, please change to another value
Most importantly I can see in your log multiple instances where PING is intermittently timing out as in my environment:
I (516434) itwt: <WIFI_EVENT_ITWT_TEARDOWN> flow_id 8(all twt)
I (516485) wifi:Set ps type: 0, coexist: 0
E (516688) itwt: <PING_EVENT_TIMEOUT> From 192.168.50.1 icmp_seq=1 timeout
E (517083) itwt: <PING_EVENT_TIMEOUT> From 192.168.50.1 icmp_seq=2 timeout
I (517216) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.50.1 icmp_seq=3 ttl=64 time=132 ms
I (517251) itwt: <PING_EVENT_SUCCESS> 56 bytes from 192.168.50.1 icmp_seq=4 ttl=64 time=13 ms
E (517487) itwt: <PING_EVENT_END> 4 packets transmitted, 2 received, 50% packet loss, time 690mn
W (517487) stats: Get tx statistics
(test)rx_trig:0, tx_bfrpt:2, tb_times:0, tb_qos_null:0, tb_qos_data:0, tb_cca_cancel:0, tb_sifs_abort:0, tb_pwr_outof_range:0
(test)aci:2, tb(suc:0, ack:0, err:0), count(suc:0, ack:0, err:0, tot:0, max_sent:0)
(test)aci:2, enable:4, complete:4, tb_times:0, tb_last:0, edca:4, succ:4, fail(0,0,0, cts:0/0.00%, ack:0/0.00%, tot:0, 0.00%), edca(ack:0, ba:4), tb(hw-ba:0, sw-ba:0)
(test)aci:2, txFrames:4, s-mpdu:0(0.00%), bitmap(max:1, min:1, tot:4, avg:1.00), retry(edca:0, tb:0, 0.00%), collision:0, timeout:0
(test)aci:2, seqno_rtt[321,77102], hw_rtt[321, 77102], muedca[enable:0, times:0, 0.00, 0.00, tot:0.00], avg:24.872 ms, tot:0.099 secs
In my opinion this should not be happening with your RSSI of -46 in an environment with average interference and confirms for me that you are also experiencing the same intermittent problems on your AP as I am.
Hi @djangoa,
I have changed the service period to 118489088us (~118s), and the test log is
twt_ping_example_2.log
, and still works fine. The "E (25283) wifi:(itwt)twt id:0 is already exist, please change to another value", is an internal error, we already found the root cause and will fix it ASAP. Thanks for reporting this issue.
I (516434) itwt:
E (516688) itwt:
(test)rx_trig:0, tx_bfrpt:2, tb_times:0, tb_qos_null:0, tb_qos_data:0, tb_cca_cancel:0, tb_sifs_abort:0, tb_pwr_outof_range:0 (test)aci:2, tb(suc:0, ack:0, err:0), count(suc:0, ack:0, err:0, tot:0, max_sent:0) (test)aci:2, enable:4, complete:4, tb_times:0, tb_last:0, edca:4, succ:4, fail(0,0,0, cts:0/0.00%, ack:0/0.00%, tot:0, 0.00%), edca(ack:0, ba:4), tb(hw-ba:0, sw-ba:0) (test)aci:2, txFrames:4, s-mpdu:0(0.00%), bitmap(max:1, min:1, tot:4, avg:1.00), retry(edca:0, tb:0, 0.00%), collision:0, timeout:0 (test)aci:2, seqno_rtt[321,77102], hw_rtt[321, 77102], muedca[enable:0, times:0, 0.00, 0.00, tot:0.00], avg:24.872 ms, tot:0.099 secs As for ping timeout, from this log, we can see, that enabled to transmit 4 data frames, and the success is also 4, the success means that the station successfully transmits the data frame and receives the ACK frame from AP. So seems like the reason why ping timeout is no ping replay received within 250 ms.
Also, I have done the ping test using iperf example under the same test env, We can see the ping delay between 4ms ~ 225ms, the main reason is that my environment is relatively noisy and there will be many wireless devices, which will interfere with each other. And I will do a long test to see if the same problem exists.
Hi @xuxiao111,
The "E (25283) wifi:(itwt)twt id:0 is already exist, please change to another value", is an internal error, we already found the root cause and will fix it ASAP. Thanks for reporting this issue.
That's great, thank you. Look forward to the fix. The ping issue is probably a red herring / distraction, and this internal error in combination with my AP might be causing the problem when my device resets.
We can see the ping delay between 4ms ~ 225ms, the main reason is that my environment is relatively noisy and there will be many wireless devices, which will interfere with each other. And I will do a long test to see if the same problem exists.
Makes sense, thanks for taking time to test this.
Hi @xuxiao111,
I've been doing further testing and come across another issue:
A strange edge case that occurs infrequently where calling the esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
function returns no error and WIFI_EVENT_ITWT_TEARDOWN
event callback is not invoked and thus my app code times out:
I (47088082) wireless: itwt teardown, flow_id: 8, err: 0x0
E (47088333) app_main: Timeout waiting for wireless::ITWT_TEARDOWN_BIT event after 250ms
Is there anything that the esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
does that could cause this hang? Normally this function triggers the callback near instantly. I'm in the process of testing whether invoking esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
a second time as a workaround after this timeout occurs causes further issues.
Calling a subsequent time works around the issue but isn't ideal. Please can you consider enhancing:
/** Argument structure for WIFI_EVENT_TWT_TEARDOWN event */
typedef struct {
uint8_t flow_id; /**< flow id */
} wifi_event_sta_itwt_teardown_t;
with any details of internal errors as per the wifi_event_sta_itwt_setup_t
struct so future itwt developers dont trip up over the same problem?
Unfortunately I was too quick to comment, calling esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
more than once causes strange behaviour, auto light sleep no longer functions. It seems RX stops retrying (The device gets more regular socket timeouts in my application code, RX retries go from ~20% - ~0%) and the wifi rx stats change from:
[tid 0] legacy:4, su(su:18, su_txbf:11, su_stbc:7, su_retry:2/11.11%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
to
[tid 0] legacy:669, su(su:20, su_txbf:20, su_stbc:0, su_retry:0/0.00%, ersu:0, ersu_dcm:0, su_noeb:5/25.00%)
What is su_noeb?
So I'm stuck resetting the device when esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
hangs but maybe this is also related to "id:0 is already exist" issue.
Hi @xuxiao111,
Furthermore, we're experiencing issues with several other aspects of HE wifi on the C6, these are some rx stats during longer testing:
I (20458958) wireless: [tid 0] legacy:9, su(su:8, su_txbf:5, su_stbc:3, su_retry:0/0.00%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
...
I (20508968) wireless: [tid 0] legacy:0, su(su:36, su_txbf:0, su_stbc:36, su_retry:35/97.22%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
...
I (22709735) wireless: [tid 0] legacy:0, su(su:36, su_txbf:0, su_stbc:36, su_retry:35/97.22%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
...
W (24040213) wireless: No rx statistics available
I (24040214) wireless: [tid 0] legacy:1, ht(ht:0, ht_retry:0/0.00%, ht_noeb:0/0.00%)
I (24040215) wireless: [tid 0] legacy:1, su(su:0, su_txbf:0, su_stbc:0, su_retry:0/0.00%, ersu:147, ersu_dcm:147, su_noeb:0/0.00%)
I (24040216) wireless: [tid 0] legacy:1, mu(mu:0, mimo:0, non-mimo:0, txbf:0, stbc:0, mu_retry:0/0.00%, mu_noeb:0/0.00%)
They don't reflect the TX retries from the AP that hover around ~15% and as can be seen in the log above after some time the device starts to see nearly 100% retry rates and then stops reporting anything other than ersu*. This coincides with a jump in heap high water mark of 159136K vs 147316K. Additionally the AP reports the RX rate of the device as 7.3 Mbps (MCS0 with 3.2 µs GI) from which it never recovers. A reset resolves the issue and the PHY rate goes back to normal 86 Mbps.
Additionally esp_wifi_get_rx_mu_statistics(&rx_mu_stats);
stats report nothing, so can't see details regarding MCS:
W (2026) wireless: No rx mimo statistics
W (2026) wireless: (mimo)dut rx:0
W (2026) wireless: (mimo)sigb, mcs0:0(nan%), mcs1:0(nan%), mcs2:0(nan%), mcs3:0(nan%), mcs4:0(nan%), mcs5:0(nan%)
W (2028) wireless: (mimo)users num2:0(nan%), num3:0(nan%), num4:0(nan%), num5:0(nan%), num6:0(nan%), num7:0(nan%), num8:0(nan%)
W (2029) wireless: No rx non mimo statistics
W (2030) wireless: (nonmimo)dut rx:0
W (2030) wireless: (nonmimo)ru_alloc_96_num_2046:0, ru_alloc_112_num_2046:0
W (2031) wireless: (nonmimo)sigb, mcs0:0(nan%), mcs1:0(nan%), mcs2:0(nan%), mcs3:0(nan%), mcs4:0(nan%), mcs5:0(nan%)
W (2033) wireless: (nonmimo)users, num1:0(nan%), num2:0(nan%), num3:0(nan%), num4:0(nan%), num5:0(nan%), num6:0(nan%), num7:0(nan%), num8:0(nan%), num9:0(nan%)
Finally, excessive (>50%) TX retry rates occur when the C6 is close to our AP, reducing the tx power down from 20db -> 16.5db mitigates.
I'm running another long duration test with TWT disabled to see if the same issues occur with the stuck PHY rate, if it does I'll create a new bug report, else it's probably best to keep all these teardown/setup issues here?
What hardware and setup are you using to capture wifi frames? Maybe running that here can facilitate analysing these problems.
Hi @xuxiao111,
I've been doing further testing and come across another issue:
A strange edge case that occurs infrequently where calling the
esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
function returns no error andWIFI_EVENT_ITWT_TEARDOWN
event callback is not invoked and thus my app code times out:I (47088082) wireless: itwt teardown, flow_id: 8, err: 0x0 E (47088333) app_main: Timeout waiting for wireless::ITWT_TEARDOWN_BIT event after 250ms
Is there anything that the
esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
does that could cause this hang? Normally this function triggers the callback near instantly. I'm in the process of testing whether invokingesp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
a second time as a workaround after this timeout occurs causes further issues.
Hi, @djangoa.
After we call the esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
functions, STA will send a TWT teardown frame to inform the AP that the STA has exited the TWT (Target Wake Time) state. The STA will post a teardown event only after successfully sending this frame. From the observation, it seems that the teardown frame failed to be sent, thus resulting in the absence of a teardown event.
Calling a subsequent time works around the issue but isn't ideal. Please can you consider enhancing:
/** Argument structure for WIFI_EVENT_TWT_TEARDOWN event */ typedef struct { uint8_t flow_id; /**< flow id */ } wifi_event_sta_itwt_teardown_t;
with any details of internal errors as per the
wifi_event_sta_itwt_setup_t
struct so future itwt developers dont trip up over the same problem?
yeah, I think we need to add some enhancement here, regardless of whether the TWT teardown frame is successfully sent or not, a teardown event should be posted to inform the user whether the teardown process was successful or not.
Unfortunately I was too quick to comment, calling
esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
more than once causes strange behaviour, auto light sleep no longer functions. It seems RX stops retrying (The device gets more regular socket timeouts in my application code, RX retries go from ~20% - ~0%) and the wifi rx stats change from:
[tid 0] legacy:4, su(su:18, su_txbf:11, su_stbc:7, su_retry:2/11.11%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
to
[tid 0] legacy:669, su(su:20, su_txbf:20, su_stbc:0, su_retry:0/0.00%, ersu:0, ersu_dcm:0, su_noeb:5/25.00%)
What is su_noeb?
So I'm stuck resetting the device when
esp_wifi_sta_itwt_teardown(FLOW_ID_ALL)
hangs but maybe this is also related to "id:0 is already exist" issue.
Can you please provide more logs? Is it sufficient to trigger this behavior by simply calling esp_wifi_sta_itwt_teardown(FLOW_ID_ALL) multiple times?
Hi @xuxiao111,
Furthermore, we're experiencing issues with several other aspects of HE wifi on the C6, these are some rx stats during longer testing:
I (20458958) wireless: [tid 0] legacy:9, su(su:8, su_txbf:5, su_stbc:3, su_retry:0/0.00%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%) ... I (20508968) wireless: [tid 0] legacy:0, su(su:36, su_txbf:0, su_stbc:36, su_retry:35/97.22%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%) ... I (22709735) wireless: [tid 0] legacy:0, su(su:36, su_txbf:0, su_stbc:36, su_retry:35/97.22%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%) ... W (24040213) wireless: No rx statistics available I (24040214) wireless: [tid 0] legacy:1, ht(ht:0, ht_retry:0/0.00%, ht_noeb:0/0.00%) I (24040215) wireless: [tid 0] legacy:1, su(su:0, su_txbf:0, su_stbc:0, su_retry:0/0.00%, ersu:147, ersu_dcm:147, su_noeb:0/0.00%) I (24040216) wireless: [tid 0] legacy:1, mu(mu:0, mimo:0, non-mimo:0, txbf:0, stbc:0, mu_retry:0/0.00%, mu_noeb:0/0.00%)
They don't reflect the TX retries from the AP that hover around ~15% and as can be seen in the log above after some time the device starts to see nearly 100% retry rates and then stops reporting anything other than ersu*. This coincides with a jump in heap high water mark of 159136K vs 147316K. Additionally the AP reports the RX rate of the device as 7.3 Mbps (MCS0 with 3.2 µs GI) from which it never recovers. A reset resolves the issue and the PHY rate goes back to normal 86 Mbps.
Additionally
esp_wifi_get_rx_mu_statistics(&rx_mu_stats);
stats report nothing, so can't see details regarding MCS:W (2026) wireless: No rx mimo statistics W (2026) wireless: (mimo)dut rx:0 W (2026) wireless: (mimo)sigb, mcs0:0(nan%), mcs1:0(nan%), mcs2:0(nan%), mcs3:0(nan%), mcs4:0(nan%), mcs5:0(nan%) W (2028) wireless: (mimo)users num2:0(nan%), num3:0(nan%), num4:0(nan%), num5:0(nan%), num6:0(nan%), num7:0(nan%), num8:0(nan%) W (2029) wireless: No rx non mimo statistics W (2030) wireless: (nonmimo)dut rx:0 W (2030) wireless: (nonmimo)ru_alloc_96_num_2046:0, ru_alloc_112_num_2046:0 W (2031) wireless: (nonmimo)sigb, mcs0:0(nan%), mcs1:0(nan%), mcs2:0(nan%), mcs3:0(nan%), mcs4:0(nan%), mcs5:0(nan%) W (2033) wireless: (nonmimo)users, num1:0(nan%), num2:0(nan%), num3:0(nan%), num4:0(nan%), num5:0(nan%), num6:0(nan%), num7:0(nan%), num8:0(nan%), num9:0(nan%)
Finally, excessive (>50%) TX retry rates occur when the C6 is close to our AP, reducing the tx power down from 20db -> 16.5db mitigates.
I'm running another long duration test with TWT disabled to see if the same issues occur with the stuck PHY rate, if it does I'll create a new bug report, else it's probably best to keep all these teardown/setup issues here?
What hardware and setup are you using to capture wifi frames? Maybe running that here can facilitate analysing these problems.
Furthermore, we're experiencing issues with several other aspects of HE wifi on the C6, these are some rx stats during longer testing: Hi, for esp_wifi_get_rx_mu_statistics(&rx_mu_stats), only when the STA receives an MU-PPDU, calling esp_wifi_get_rx_mu_statistics(&rx_mu_stats) will have data available. Please fell free to open a new bug to track this issues. We are using intel AX200 wifi card to capture wifi frame, you can fellow this guide: https://wifilogic.wordpress.com/2019/09/21/how-to-setup-wifi6-sniffer-in-ubuntu/
Can you please provide more logs? Is it sufficient to trigger this behavior by simply calling esp_wifi_sta_itwt_teardown(FLOW_ID_ALL) multiple times?
It takes about a day to get the issue to trigger, I'll post the debug logs when possible.
Here's the verbose log, not sure it's helpful other than showing the failed teardown and lack of sleeping (wifi:mms: 2->2
) between wake-up iterations.
...
I (190934) wireless: Flow_id bitmap: 00000010
V (190934) esp_adapter: thread sem get: sem=0x4082b0b4
I (190935) wireless: itwt teardown, flow_id: 1, err: 0x0
D (191093) wifi:mms: 2->2
E (191186) app_main: Timeout waiting for wireless::ITWT_TEARDOWN_BIT event after 250ms, attempt 1
V (191186) esp_adapter: thread sem get: sem=0x4082b0b4
I (191186) wireless: Flow_id bitmap: 00000010
V (191187) esp_adapter: thread sem get: sem=0x4082b0b4
I (191188) wireless: itwt teardown, flow_id: 1, err: 0x0
D (191188) wifi:mms: 2->0
D (191189) event: running post WIFI_EVENT:29 with handler 0x4200eaca and context 0x4083749c on loop 0x4087e6c8
I (191190) wireless: WIFI_EVENT_ITWT_TEARDOWN flow_id 1
V (191190) esp_adapter: thread sem get: sem=0x40837dc0
I (191191) wireless: Flow_id bitmap: 00000000
V (191192) esp_adapter: thread sem get: sem=0x4082b0b4
I (191192) wifi:Set ps type: 0, coexist: 0
D (191193) tiT: sys_restart_timeouts() complete
...
...
D (191195) wifi:mms: 0->0
I (191202) wifi:Set ps type: 1, coexist: 0
V (191202) esp_adapter: thread sem get: sem=0x4082b0b4
I (191202) wifi:Set ps type: 1, coexist: 0
V (191203) app_main: Suspended LWIP TCPIP thread: 'tiT'
V (191203) esp_adapter: thread sem get: sem=0x4082b0b4
I (191204) wireless: Flow_id bitmap: 00000000
V (191205) esp_adapter: thread sem get: sem=0x4082b0b4
I (191206) wireless: itwt setup, flow_id: 1, err: 0x0
W (191278) wifi:not hit beacon in 65535 sp(s)
D (191278) wifi:mms: 0->2
W (191278) wifi:awake:1, dream:0, sleep:0
D (191279) event: running post WIFI_EVENT:28 with handler 0x4200e7d8 and context 0x4083746c on loop 0x4087e6c8
I (191279) wireless: WIFI_EVENT_ITWT_SETUP twt_id: 0, flow_id: 1, trigger-enabled, announced, wake_dura: 255, wake_dura_unit: 0, wake_invl_e: 13, wake_invl_m: 46875
I (191281) wireless: WIFI_EVENT_ITWT_SETUP target wake time: 198265864192, wake duration: 65280us, service period: 384000000us, exec timer 74.70ms
V (191283) esp_adapter: thread sem get: sem=0x40837dc0
I (191284) wireless: Flow_id bitmap: 00000010
V (191284) esp_adapter: thread sem get: sem=0x4082b0b4
W (191285) wireless: Get tx statistics
I (191286) wireless: [aci:2] enable:18, complete:18, tb_times:0, tb_last:0, edca:18, succ:14, fail(0,0,0, cts:0/0.00%, ack:4/22.22%, tot:4, 22.22%), edca(ack:4, ba:10)
I (191287) wireless: [aci:2] txFrames:14, s-mpdu:4(22.22%), bitmap(max:1, min:1, tot:10, avg:1.00), retry(edca:8, tb:0, 57.14%), collision:0, timeout:0
I (191289) wireless: [aci:2] seqno_rtt[147,2267], hw_rtt[147, 2267], avg:0.444 ms, tot:0.008 secs
V (191290) esp_adapter: thread sem get: sem=0x4082b0b4
W (191291) wireless: Get rx statistics
I (191292) wireless: [tid 0] legacy:8, su(su:14, su_txbf:11, su_stbc:3, su_retry:3/21.43%, ersu:0, ersu_dcm:0, su_noeb:0/0.00%)
V (191293) esp_adapter: thread sem get: sem=0x4082b0b4
I (191294) wireless: stats, tx_retry: 57.14/49.66/49.66%, rx_retry: 21.43/27.04/27.04%, avg_rtt: 0.444/0.403/0.403ms, samples: 10/20/20
W (191295) wireless: Clear statistics
V (191296) esp_adapter: thread sem get: sem=0x4082b0b4
V (191297) esp_adapter: thread sem get: sem=0x4082b0b4
V (191297) esp_adapter: thread sem get: sem=0x4082b0b4
D (191298) wifi:mms: 2->2
V (191299) esp_adapter: thread sem get: sem=0x4082b0b4
V (191299) esp_adapter: thread sem get: sem=0x4082b0b4
V (191300) esp_adapter: thread sem get: sem=0x4082b0b4
V (191300) app_main: Iteration exec time: 433ms
W (191301) app_main: Heap free/max/total : 283740 / 117560 / 395480B
V (191302) app_main: Stack max alloc: 5464 / 8192B
D (191303) app_main: Delay until next iteration: 9567ms
D (191502) wifi:mms: 2->2
D (191810) wifi:mms: 2->2
D (191935) temperature_sensor: range changed, change to index 3
D (192118) wifi:mms: 2->2
D (192220) wifi:mms: 2->2
D (192424) wifi:mms: 2->2
D (192731) wifi:mms: 2->2
D (192935) temperature_sensor: range changed, change to index 4
D (193038) wifi:mms: 2->2
D (193141) wifi:mms: 2->2
D (193346) wifi:mms: 2->2
D (193550) wifi:mms: 2->2
D (193935) temperature_sensor: range changed, change to index 4
D (193960) wifi:mms: 2->2
D (194063) wifi:mms: 2->2
D (194165) wifi:mms: 2->2
D (194370) wifi:mms: 2->2
D (194472) wifi:mms: 2->2
D (194935) temperature_sensor: range changed, change to index 4
D (195087) wifi:mms: 2->2
D (195394) wifi:mms: 2->2
D (195598) wifi:mms: 2->2
D (195701) wifi:mms: 2->2
D (195935) temperature_sensor: range changed, change to index 4
D (196315) wifi:mms: 2->2
D (196622) wifi:mms: 2->2
D (196930) wifi:mms: 2->2
D (196935) temperature_sensor: range changed, change to index 4
D (197237) wifi:mms: 2->2
D (197544) wifi:mms: 2->2
D (197935) temperature_sensor: range changed, change to index 4
D (198670) wifi:mms: 2->2
D (198773) wifi:mms: 2->2
D (198935) temperature_sensor: range changed, change to index 4
D (199080) wifi:mms: 2->2
D (199387) wifi:mms: 2->2
D (199694) wifi:mms: 2->2
D (199935) temperature_sensor: range changed, change to index 4
D (200002) wifi:mms: 2->2
D (200206) wifi:mms: 2->2
D (200514) wifi:mms: 2->2
D (200616) wifi:mms: 2->2
D (200821) wifi:mms: 2->2
V (200870) app_main: Next iteration starting
...
Answers checklist.
IDF version.
5.1.3
Espressif SoC revision.
ESP32-C6-FH4
Operating System used.
Windows
How did you build your project?
Eclipse IDE
If you are using Windows, please specify command line type.
None
Development Kit.
DFROBOT FireBeetle 2 ESP32 C6 IoT Development Board
Power Supply used.
USB
What is the expected behavior?
Calling _esp_wifi_sta_itwt_teardown(flowid) tears down the selected Targeted Wake Time and allows network traffic to flow freely through the wifi interface on receiving a WIFI_EVENT_ITWT_TEARDOWN event.
What is the actual behavior?
Calling _esp_wifi_sta_itwt_teardown(flowid) intermittently causes select() timeouts when creating new connections.
See the attached DEBUG Logs of an application I've been working on that integrates the itwt and mqtt tcp examples and sits in a loop publishing data from a sensor to a MQTT broker then sleeping for 5minutes.
In the Debug Logs:
Steps to reproduce.
Setup and start wifi. Create a non trivial TWT agreement with an AP using a service period of around 5mins (e.g. 46875 << 13) , then in a loop:
Debug Logs.
More Information.
I've tried many things to work round the issue including turning off various power saving techniques to omit them as a possible cause:
I've also tried idling after a teardown with vTaskDelay() before creating a connecting. Finally, I've tried using _esp_wifi_sta_itwt_teardown(FLOW_IDALL) to no effect. It also doesn't matter what timeout I give (100ms or 10000ms) when creating a tcp mqtt connection, it still fails taking the full timeout duration.
For context the wifi connection seems OK and is to a Ubiquiti UniFi U6 Enterprise AP with ~-40dbm RSSI and ~20% TX retries. Finally and anecdotally, the errors "seem" to come in waves as if related to interference? I've attached my sdkconfig.zip from the application I'm experiencing the issue in case I've missed anything.
On a related note, I'm also experiencing strange behaviour and fairly high power consumption in auto light sleep during TWT, ~110ua vs ~35ua as advertised, with a period of 1s:
These don't seem to be ARP related as per: https://github.com/espressif/esp-idf/issues/12441#issuecomment-1775029218. Setting ARP_TMR_INTERVAL to any other value, currently set at 2000ms in the graph, does not change the interval shown on the graph. I've also seen strange transitions to auto light sleep after setting up a TWT that looks similar to https://github.com/espressif/esp-idf/issues/11858#issuecomment-1640244734 where the MODEM power domain is not powering down and thus current draw is around ~300ua. This can be seen earlier in the same current trace as above in the previous iteration:
Furthermore, I've tried switching dev boards vendors (both exhibit the same power draw), disconnected all peripherals and disabled/floated/hold all relevant GPIO to no avail. I've also edited sleep_modem.c with _esp_sleep_cpu_retentioninit() and get the same power draw without initialising wifi and manually invoking light sleep but strangely without the 1 second pulse in current consumption.
Hopefully someone can point me in the right direction, if I haven't provide enough detail.