Open knight-ryu12 opened 1 year ago
Update on this:
if, LWIP Enable reassembly incoming fragmented IP4 packets
are enabled, above example will not work. if disabled, it suddenly work now.
Hmm.
I'm having the same issue (dpp-enrollee yields ESP_ERR_DPP_TX_FAILURE). For me it is strangely persistent but unpredictable. That is, I've had it work correctly in the past, but somehow once a given device gets into this state it will yield this error over and over no matter what I do, even if I reset the flash. I believe @knight-ryu12 is describing something similar where a seemingly unrelated change "fixed it". Looks like others have also experienced this: https://esp32.com/viewtopic.php?t=28573.
Needless to say this unpredictability makes it impossible to actually ship something that uses DPP, so we need to find some predictable way of detecting and working around this condition.
For reference, my ESP32-C3-DevKit-M1U is working, my ESP32-C3-DevKit-M1 isn't. Both are using v4.4.4.
Hi @jasta , we are trying to reproduce it locally. Are you using the example?
Please share your sdkconfig as well as IDF version commit.
Yes, I'm able to reproduce consistently with the unmodified example on the current v4.4.4 tag of esp-idf. I am using the default settings (channel 6, no provided key/device info). Here's the sdkconfig: https://gist.github.com/42ef0f07990ca812bba8b541685ef798
The bug really smells like a race condition IMO. I had it all working perfectly and changed nothing of significance about my app and it just started giving me ESP_ERR_DPP_TX_FAILURE over and over. Then no amount of reverting my code could fix it, including reverting all the way back to the unmodified example which is where I started. It of course used to work with the sample and even my considerably more robust full app. I suspect that the "condition" that changes to cause it to become persistently broken is literally in the air -- something about my Wi-Fi setup must be able to consistently reproduce an "unlikely" race condition outcome.
No hardware has been deliberately modified or replaced since it was once working, so the only possibilities in my view are environmental (wireless signals themselves changing) or through automated software updates of either my router (Google WiFi infrastructure) or my phone (Android Pixel 6)
Well that's interesting, even though I get ESP_ERR_DPP_TX_FAILURE pretty much every time, I now just got ESP_ERR_DPP_INVALID_ATTR. Adds quite a bit of evidence to my theory that this is a race :)
I swear it is seeming like the difference between the two errors is whether I have the esp32 on my desk (DPP_INVALID_ATTR) or in my component drawer (DPP_TX_FAILURE). Bizarre :)
As an aside, I am the one working on adding DPP support to Rust and ran into this issue maturing the implementation even though I didn't change anything functionally interesting with respect to esp_* calls: https://github.com/esp-rs/esp-idf-svc/pull/228 .
I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results)
I have logs with the full debugging turned up but I'm not posting here as I believe they will contain my Wi-Fi creds. Lmk and I can share them privately or reproduce with a dummy network (though I suspect changing my network around will change the results)
Please black out wifi cred from the logs and share the rest.
I can't actually tell what is and isn't sensitive about this, but I'm interested enough in getting this solved that I'll risk it hehe: https://gist.github.com/09fe320e7b549967b37088170e59c5cb. Here is the updated sdkconfig after I enabled logging: https://gist.github.com/8ab02d9c9ca064861e9e1cdf22261545
I confirmed again this morning it still repros. I seemingly have a 100% reliable repro (it's happened at least the last 20 or 30 times I've tried) on a devkit module that once worked just fine. One maybe important detail though is that in the example I'm unable to scan the QR code in the console (my phone never recognizes it at as a QR code), so I'm copying the QR text into qtqr and generating one there. I can confirm that if I fudge the QR text I get a different error ("No matching own bootstrapping key found as responder - ignoring message"), and then it yields ESP_ERR_DPP_INVALID_ATTR (different than the INVALID_ATTR case I got when I was randomly moving the device around physically).
I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)
I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)
I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference.
From my testing today, I've found that if the channel the esp is currently listening on happens to match whichever channel my phone was connected to AP on, the chance of ESP_ERR_DPP_TX_FAILURE
dramatically drops and would happen rarely.
This would explain the issue being intermittent and also not fixing itself for a while, since AP's dont really change their channels too often, if at all.
However that doesn't get us very far since it means we need to somehow know the channel before we know what AP or anything. Which also doesn't make sense since the phone already sends the first frame to the esp on the correct channel to begin with (channel 6 in the example), and then attempts to send it back to the phone on that same channel.
From what I can tell, the issue comes from the call esp_wifi_action_tx_req
on line 96 inside esp_send_action_frame
Although the error isn't immediately sent to the callback in the lines below and instead that callback is sent the error only when the WIFI_EVENT_ACTION_TX_STATUS
wifi event is triggered on line 447 in offchan_event_handler
So from what I can tell, the issue comes down to just being purely a channel mismatch somehow and the esp not wanting to send to a different channel than its listening to(just a wild guess on that last part though).
And, it seems that there's a fix that works perfectly(from my testing atleast), all it takes is enabling Multi Band Support in your sdkconfig. So essentially all you have to do is:
CONFIG_WPA_11KV_SUPPORT=y
CONFIG_WPA_SCAN_CACHE=y
CONFIG_WPA_MBO_SUPPORT=y
CONFIG_WPA_DPP_SUPPORT=y
Instead of only doing CONFIG_WPA_DPP_SUPPORT=y
(11KV_SUPPORT and SCAN_CACHE are both required by MBO_SUPPORT)
This fix seems to work well enough that I can even have the esp listening only on channel 10, my phone can be sitting on channel 149, and I can get the esp to connect to a network on channel 1 with no issue
I have no idea if this is meant to fix it, or if it's going to break somewhere down the like, but nothing I can do can cause the ESP_ERR_DPP_TX_FAILURE
as long as I have MBO Support enabled
I don't know for sure, but it crashes Wi-Fi event handler hardly afterwards. I cannot receive any events after DPP had failed. I assume that might have to do with how LWIP is handling packets? (Or is it just my condition being bad...)
I see the same thing re the hard crash, but I don't believe it is related to LWIP packet fragmentation. These aren't even IP packets AFAICT, they're action frame packets in 802.11. Further, I'm seeing essentially identical behavior to you with the unmodified example, which I'm guessing is the same thing our friends at Espressif are testing with but seeing different results. Highly likely an environmental condition causing the difference.
Afaik it doesnt actually crash anything, it just sets s_dpp_stop_listening
to true after getting the first packet, which stops it from sending more ROC events recursively, with the intention of setting s_dpp_stop_listening
to false upon a successful dpp configure, and then sending a ROC event to start the recursive loop of them using esp_supp_dpp_start_listen()
From my testing, the wifi driver and event loop are all still fully functioning even when that error happens, its just there's no events actually going on to show up in the log but if you set s_dpp_stop_listening
to false and call the start_listen function again, or if you manually post events, they'll show up normally
Also I doubt lwip actually has anything to do with the issue since i dont remember any mention of lwip in esp_dpp.c, and iirc lwip is a tcp/ip stack, and during dpp configuration it wouldnt make sense for there to be a tcp/ip stack considering only a limited amount of raw frames are sent between the two devices anyways.
Changing lwip settings likely only fixes things due to how inconsistent the ESP_ERR_DPP_TX_FAILURE
error is without CONFIG_WPA_MBO_SUPPORT=y
since i know atleast for me changing things like that would fix it for exactly the first time i ran it after flashing and no more(even if i reset the nvs once the error reappears)
Afaik it doesnt actually crash anything, it just sets
s_dpp_stop_listening
to true after getting the first packet, which stops it from sending more ROC events recursively, with the intention of settings_dpp_stop_listening
to false upon a successful dpp configure, and then sending a ROC event to start the recursive loop of them usingesp_supp_dpp_start_listen()
Ahh good catch, however what I'm seeing is that esp_supp_dpp_start_listen()
does not actually once again start listening. That is, if you hit Retry in the Android Wi-Fi UI it will fail without printing any more debug output on the esp32. Looking deeper from your analysis, it looks like the reason is because esp_supp_dpp_start_listen()
doesn't clear the s_dpp_stop_listening
flag so even if you call esp_supp_dpp_start_listen()
again the call to offchan_event_handler
trying to trigger SIG_DPP_LISTEN_NEXT_CHANNEL
will be silently ignored (because s_dpp_stop_listening
is still true). The code is subtle in that no matter what it'll get one SIG_DPP_LISTEN_NEXT_CHANNEL
event handled after esp_supp_dpp_start_listen()
, but the timeout is 500ms and it will not try again even on the same channel.
@kapilkedawat, so at the very least we've identified one clear bug that needs fixing: esp_supp_dpp_start_listen()
needs to call s_dpp_stop_listening = false
. I've tested this locally and confirmed that it indeed lets us try again any number of times using the example code.
@gayafhannah, I'm going to go try your MBO workaround now and report back, awesome sleuthing BTW!
@jasta I've also just noticed that since i'm not testing on the master branch and am instead on, in the master branch it seems to be called ESP_WIFI_MBO_SUPPORT
in Kconfig, but WPA_MBO_SUPPORT
in 5.0.0
I'll have to try grab myself the master branch and see how that goes
@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still.
Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf.
@gayafhannah, MBO support with v4.4.4 at least didn't seem to affect the results, though I still think you're onto something with your analysis. My network set-up is that my phone is connected to a Google Wi-Fi network on 5GHz ch 149, but the 2.4GHz network is on channel 1. I've tried a bunch of configurations of different channels for DPP to use though and still can't get it to work. I'm a bit stumped still.
Note: I used idf.py menuconfig to enable MBO support so the settings should match my local checkout. I'm using the v4.4.4 tag in esp-idf.
I know that without MBO support, i was able to get it to be more consistent at not having the error(although not 100%) if the only channel dpp was configured to listen on was the channel of the wifi network, and with my phone also connected to that network, and therefore on that same channel, which is what led me to mbo support when snooping around menuconfig.
Disabling MBO and setting channels to 1,6 worked once, but then I rebooted the device and tried one more time and it failed. Then rebooted and tried again several times until it finally worked (which I can only do without rebooting because of my patch in esp_supp_dpp_start_listen()
. How bizarre.
I've attached an abridged log comparing the failed attempt vs the successful attempt (no reboot occurred, I just tried again on the phone): https://gist.github.com/ca10e95e786b79fe847c3a13b297e732
The interesting bit from the logs is that seemingly nothing is different about the failure case and successful one. Both using Channel-6, exchanging seemingly the same data with the same timings, etc.
Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp I should note though that i haven't tried using the example in the repo and instead have been testing with my own heavily modified version, although there shouldn't be much difference in how they work in the end. Also @jasta there may be a chance that on v4.4.4 MBO support may not work quite the same, especially considering its an entire major release difference. 99% of my testing has been done on v5.0.0
Have thrown together a pull request #10812 that should work for the latest version of espidf with the changes i made to get fully functional dpp
Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).
I should note though that i haven't tried using the example in the repo and instead have been testing with my own heavily modified version, although there shouldn't be much difference in how they work in the end. Also @jasta there may be a chance that on v4.4.4 MBO support may not work quite the same, especially considering its an entire major release difference. 99% of my testing has been done on v5.0.0
Planning to retest all this on v5.0.1 but was sticking to v4.4.4 because that's where it once worked then stopped so I figured we could learn the most about what exact environmental difference actually matters.
Looks good, you might want to also include my fix to esp_supp_dsp_start_listen which resets the s_dpp_stop_listening flag. This is what enables retries to work as was intended by the example (i.e. not needing deinit/init before it can be retried).
I'll have to give your fix a shot in the morning That fix was actually the first thing I tried, and while it did allow me to retry even after significant errors, it had some... strange side effects. I can't explain what caused them exactly but that could just be because I was really fucking with esp_dpp.c to figure out just where was it all going wrong and somehow because of that little fix I was getting too many ROC events every second to even bother counting. Hopefully that change works better with a cleaner and working esp_dpp.c than it did this morning.
Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time.
Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide). I have no explanation for this stuff to do with channel width, it's just something I noticed during testing since the extender here is 40MHz wide and the main ap is 20MHz wide and I was constantly switching networks. I feel like I'm going crazy because I have no solid evidence and it could all just be a coincidence and there was some other environmental stuff going on, but I spent alot of time testing things and that pattern of width playing a small part(and being directly ontop of the right channel being an even bigger part) have been constants the entire time. This would also likely explain why it sounds like others have been having much more inconsistent errors, since if the target ap has a wide channel and being close to a wide one increases chances but only a bit, whereas for me having a network where if I'm not right on it, it doesn't work, until I use the extender but then it works only sometimes
I'll have to give your fix a shot in the morning That fix was actually the first thing I tried, and while it did allow me to retry even after significant errors, it had some... strange side effects. I can't explain what caused them exactly but that could just be because I was really fucking with esp_dpp.c to figure out just where was it all going wrong and somehow because of that little fix I was getting too many ROC events every second to even bother counting. Hopefully that change works better with a cleaner and working esp_dpp.c than it did this morning.
I'm not observing any strange side effects from the fix FWIW. And more importantly I'm noticing that my phone will actually automatically retry now with this setting on, indicating that the DPP standard actually expects some intermittent failures (at least due to RF interference would be my guess, but maybe even this channel weirdness). That suggests to me that the lack of working retry is a big part of the root cause here. Imagine what would happen to our global TCP infrastructure if you disabled packet retransmits, for example...
Also my guess for environmental differences causing the issues is just the phone/AP switching channels. It's the only thing I can come up with since like I said in earlier comments, setting the listen channel to be the same as the ap and phone channel, maybe 80-90% success rate atleast. Instead of the not even 1% chance if it wasn't the same. A fun way to test it was to in the channel listen list that you put into one of the functions, just up all channels 1 through to 13, I also added a little thing in esp_dpp to show the current channel as it was cycling through. If I managed to get the timing pressing the button on my phone just right to land on channel 1 on the esp(channel 1 also what the ap and my phone were on) then it'd almost always connect fine, but if I missed that channel(difficult to get it right when the timeout is 500ms) then it'd fail to connect except for maybe exactly one time.
Definitely plausible and consistent with my experience. If I just sit there and hit Retry on my Android phone over and over with my fix it will eventually work. Not ideal UX by any stretch, though, and I'd be quite surprised if we (or maybe just Espressif) couldn't come up with a more reliable way to fix this.
Also if I remember correctly, setting the channel to be near the same as the ap, if the channel width for the ap is 40MHz, I noticed even being +-2 channels from the centre channel, it had moderate success rates. Definitely not the same as setting the listen channel directly ontop of the centre channel though. Also 40MHz wide ones were slightly less successful on the centre channel than 20MHz wide(probably because 20MHz is literally just 1 channel wide). I have no explanation for this stuff to do with channel width, it's just something I noticed during testing since the extender here is 40MHz wide and the main ap is 20MHz wide and I was constantly switching networks. I feel like I'm going crazy because I have no solid evidence and it could all just be a coincidence and there was some other environmental stuff going on, but I spent alot of time testing things and that pattern of width playing a small part(and being directly ontop of the right channel being an even bigger part) have been constants the entire time. This would also likely explain why it sounds like others have been having much more inconsistent errors, since if the target ap has a wide channel and being close to a wide one increases chances but only a bit, whereas for me having a network where if I'm not right on it, it doesn't work, until I use the extender but then it works only sometimes
I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel.
I can't really say for sure, I'd probably need to dive deep into how this stuff is even supposed to function at the standard level before I could speculate why that would play a role. One curious thing I did notice however was that the QR code does in fact include the channel that the enrollee (the esp32) is listening on, so it doesn't make any sense why the configurator (the phone) would have any troubles sending/receiving messages on the correct channel.
I noticed the channels being listed in the qr code aswell, and your exact reasons are why I'm hesitant that my guess is actually correct. It's just that pattern being there made it really seem like it has to have something to do with the channel being wrong. Especially because the error afaik is caused by the esp failing to send the packet "offchannel" and not in receiving any packets, as even with the error happening, it still successful receives the first packet/frame from the phone just phone. I'd love to give it a try with my phone not connected to any network while doing dpp so it's not bound to a channel(although when my phone can handle 2 wifi connections like it's nothing that's hard to believe it's an issue) except thanks to Samsung's Oneui being borked, I can barely access the dpp provisioning to begin with, and I can't select any network other than the current one for provisioning, so I can't test that.
@kapilkedawat The plot thickens trying to work around these bugs. It looks like if you try to work around this issue by calling esp_supp_dpp_deinit/init again and just starting all over, you will quickly run into a race condition in esp_dpp.c in which this assert fails:
https://gist.github.com/a067c88e511dfff52e7f704d469e157f
I am using unmodified ESP-IDF 4.4.4 to cause this behaviour.
The reason appears to be because of an inherent race condition in which esp_supp_dpp_init makes no effort to check if the previous s_dpp_evt_queue and associated task is drained and cleaned up before proceeding to start up esp_dpp_task again. Illustrated as such:
Caller Task | dppT Task #1 | dppT Task #2 | |
---|---|---|---|
dpp_init() | |||
xQueueReceive (blocked) | |||
dpp_deinit() | |||
Handling SIG_DPP_DEL_TASK | |||
dpp_init() | |||
xQueueReceive (blocked) | |||
s_dpp_evt_queue = NULL; | |||
xQueueReceive crashes: s_dpp_evt_queue is NULL) |
It seems to me like the fix would be to have a condition variable and associated state that deinit sets true, SIG_DPP_DEL_TASK sets false, and init checks for / waits until it is false. A workaround that seems reliable enough is to sleep for 1s between deinit/init though of course this isn't really a fix.
This also suggests to me the proposed fix of clearing s_dpp_stop_listening in esp_supp_dpp_start_listen is even more important.
EDIT: Filed a separate issue for this, #10879
@gayafhannah Did you ever test on the master branch? I started prepping my separate s_dpp_stop_listening fix and noticed that while v5.0.1 still has issues, the master branch is actually working correctly! I checked in wireshark on a promiscuous third party device and discovered that in fact the OTA packets are distinct between v4.4.4/v5.0.1 and master. For the former, there are several action frame transmits in response to the initial auth request from my phone but they are all ignored by my phone. The phone eventually times out and gives up, despite in theory the packets being made available. For master, there is exactly one transmit of auth response from the esp32 and the phone picks it up right away and the whole process completes successfully really fast.
The only difference I can think of here is if the wifi libs got updated in master and fixed some subtle but important bug in the action frame tx path.
Next step I was going to try setting up wpa_supplicant on my PC so I can increase the debug logs and understand why the packets I can see sent from the esp32 are not being accepted/received by the peer, but this is slow going :)
Hey @gayafhannah @knight-ryu12 @jasta we are actively looking into this issue. For us, it is little bit hard to stimulate one consistent behavior . Also @gayafhannah regarding your PR we are trying to make more changes and make the state machine cleaner. Meanwhile, you can try out these changes
BOOTSTRAP_ROC_WAIT_TIME
in components/wpa_supplicant/esp_supplicant/src/esp_dpp_i.h
to 650Note that I was able to make things work with https://github.com/espressif/esp-idf/pull/10865. I also dug into Wireshark a bit and saw some fairly surprising results (packets going out but not picked up by the configurator phone) which made me think this is why they changed the way the initial listening phase works in DPP V2. I think the right path to go down now is to upgrade wpa_supplicant.
Hey, I am experiencing the same issue. DPP fails to connect. Is there a workaround without problems?
Hey, I am experiencing the same issue. DPP fails to connect. Is there a workaround without problems?
Can you try with a version that has my patches applied (looks like espressif backported this to v4.3, v4.4 and v5.0 branches, but I don't think any versioned releases were made with the fix)? What you should see is that there are many automatic retries if you test with the dpp-enrollee sample and it will eventually work. In either case, can you report your findings here?
Hey, I am experiencing the same issue. DPP fails to connect. Is there a workaround without problems?
Can you try with a version that has my patches applied (looks like espressif backported this to v4.3, v4.4 and v5.0 branches, but I don't think any versioned releases were made with the fix)? What you should see is that there are many automatic retries if you test with the dpp-enrollee sample and it will eventually work. In either case, can you report your findings here?
Thanks for pointing out that you patched this. We decided to go with smartConfig instead. Maybe I will return to this in the future.
Hmm I'm also experiencing the same issue when running the dpp-enrollee sample, which has the fix, using:
I got it properly working on the master branch (V5.2), but I had to return to V5.0 because of incompatibility of other dependencies. Now I get inconsistent results. I got it to work once, and other times I get errors like:
I (888) phy_init: phy_version 610,2bff4c8,Jul 27 2023,20:22:14
I (928) wifi:mode : sta (f4:12:fa:db:91:d8)
I (928) wifi:enable tsf
I (938) wifi dpp-enrollee: Started listening for DPP Authentication
I (16848) wifi dpp-enrollee: DPP Auth failed (Reason: ESP_ERR_DPP_FAILURE), retry...
Shouldn't it auto-retry here (at least 5 times in the example?)? If I keep on trying in the UI on my phone, it eventually works, but this isn't workable.
I also tried to set the config options:
CONFIG_WPA_11KV_SUPPORT=y
CONFIG_WPA_SCAN_CACHE=y
CONFIG_WPA_MBO_SUPPORT=y
CONFIG_WPA_DPP_SUPPORT=y
But that didn't resolve it either.
Any ideas?
Hey @Stencht @jasta @gayafhannah We have prepared a patch which we think will fix this issue and need your help to test it. This patch is based on v5.0.3 (e004907bd53b68e6db2e29ae335d8b38cbb39472)
esp-idf
using command git am dpp_off_tx.patch
esp-idf/components/esp_wifi/lib
using command git am --binary dpp_off_tx_lib.patch
@Shreyas0-7 I tried your patch with an ESP32 + Android 14 (Lineage OS) phone.
Half the time it works. The other half, nothing is printed on the serial console, Android times out with Couldn't add device
and offers me to Retry. Clicking the Retry button always leads to another timeout. The only way to get a success in this case is to reboot the ESP32 and scan the QR code again.
@michael-betz Can you please set ESP into debug mode and attach ESP serial logs if possible
Hi, sorry for the delay.
Here's the complete debug log of dpp failing 5 times (I push the retry button on my phone 5 times)
Hi @michael-betz, thanks for the logs. Was the mobile connected to the AP during this time, and was the AP operating on another channel? From the logs, it looks like the mobile didn't receive the frames sent by the device (it didn't wait for sufficient time). T
his issue was later realized, and DPP R2 has provisions to mitigate this type of scenario.
Could you also try setting the ESP listen channel to be the same as the AP's channel and test it again?
@michael-betz Any update regarding the comment in https://github.com/espressif/esp-idf/issues/10615#issuecomment-2271641093 ?
Okay, I gave it another shot, this time I'm testing against a FRITZ!Box 7590 on channel 6.
Previously I was testing against an Android Hotspot (running on another phone, not the one I use for scanning the QR code).
I've set EXAMPLE_DPP_LISTEN_CHANNEL_LIST "6"
to match the channel of the target wifi.
The first 3 attempts failed as before.
Then I enabled WPA_DEBUG_PRINT
in menuconfig and re-compiled.
Surprisingly, the next 15 attempts all succeeded. Even after changing EXAMPLE_DPP_LISTEN_CHANNEL_LIST
to 12 and letting the phone connect to the wifi.
Then I disabled WPA_DEBUG_PRINT
, re-compiled and only 4 out of 8 attempts succeeded.
What a strange result.
So I've just repeated this test one more time.
EXAMPLE_DPP_LISTEN_CHANNEL_LIST "12"
With WPA_DEBUG_PRINT
enabled: 7 / 8 attempts succeeded
Same setup but WPA_DEBUG_PRINT
disabled: 2 / 8 attempts succeeded.
Could there be a race condition somewhere?
2024.09.06
My problem is solved. The actual problem is that the previous URI is used as the QRCode data source, which causes a mismatch.
IDF version: 5.3
Same problem, but when i read code, i found that caused by hash compare of keys: https://github.com/espressif/esp-idf/blob/0bbd72819602b4b8cf786a8833d0ee9c922268bf/components/wpa_supplicant/esp_supplicant/src/esp_dpp.c#L165-L172
The log start with DPP key pair generation until share AP secret failed from Android with QR code scan.
D (8736) wpa: DPP: Generating a keypair
E (8736) wpa: curve: secp256r1
E (8736) wpa: prime len is 32
E (8736) wpa: Q_x: - hexdump(len=32):
E (8737) wpa: da 13 d2 0d 14 a4 bc c4 4c 60 52 1b bb e2 99 d4
E (8738) wpa: 33 4f 6f 05 fa c5 84 0d a7 78 83 f7 01 dc 8e 5b
E (8738) wpa: Q_y: - hexdump(len=32):
E (8739) wpa: 23 59 c0 83 8a 32 57 a5 f1 e2 cb ca 4f 97 09 c5
E (8740) wpa: 2e 40 0b 22 24 1b b2 f1 87 80 a4 c8 37 21 d1 32
E (8740) wpa: d: - hexdump(len=32):
E (8741) wpa: 6b e5 79 52 ce bc d4 83 2f ba 29 39 43 3c d8 94
E (8742) wpa: 59 e7 ad 8e d2 a7 ab 17 cd e9 37 57 a1 41 7a 01
D (8743) wpa: DPP: Compressed public key (DER) - hexdump(len=59):
D (8744) wpa: 30 39 30 13 06 07 2a 86 48 ce 3d 02 01 06 08 2a
D (8744) wpa: 86 48 ce 3d 03 01 07 03 22 00 02 da 13 d2 0d 14
D (8745) wpa: a4 bc c4 4c 60 52 1b bb e2 99 d4 33 4f 6f 05 fa
D (8746) wpa: c5 84 0d a7 78 83 f7 01 dc 8e 5b
D (8746) wpa: DPP: Public key hash - hexdump(len=32):
D (8747) wpa: 77 e7 21 a5 b5 a4 e4 ab 94 cc d9 cb 29 29 99 24
D (8748) wpa: a0 60 20 7a 37 6d 64 77 72 b7 ac 52 41 71 18 eb
D (8748) wpa: DPP: URI mac: 30:30:f9:5a:8a:54
D (9251) wpa: DPP: Sent event 3 to DPP task
~~ Repeat ~~
D (11258) wpa: DPP: Sent event 3 to DPP task
I (01/01/70 00:00:10.598) [ scr:wifi_sc ] update qrcode done
D (11760) wpa: DPP: Sent event 3 to DPP task
~~ Repeat ~~
D (15772) wpa: DPP: Sent event 3 to DPP task
D (15788) wpa: DPP: Sent event 2 to DPP task
D (15788) wpa: DPP: Rx Public Action frame: action - 9
I (15789) wpa: DPP: Authentication Request from b6:c0:af:07:30:b6
V (15790) wpa: DPP: Responder Bootstrapping Key Hash - hexdump(len=32):
V (15790) wpa: 09 bb 6f cc 65 d0 90 e7 61 e5 00 d5 a7 91 8e f3
V (15791) wpa: 38 19 16 07 d0 5b a4 28 b5 a7 1b ff 1d 48 94 4a
V (15791) wpa: DPP: Initiator Bootstrapping Key Hash - hexdump(len=32):
V (15792) wpa: a9 18 43 12 b6 8b fa 58 b9 f8 3c 00 b7 6f 99 b2
V (15793) wpa: 42 a1 4c d2 da 56 d8 55 46 68 fb 6c 38 4f 92 07
I (15794) wpa: DPP: No matching own bootstrapping key found as responder - ignore message
D (15795) wpa: DPP: Sent event 3 to DPP task
E (01/01/70 00:00:14.657) [ antx:idf:wifi ] DPP auth fail, reason: ESP_ERR_DPP_INVALID_ATTR, data: 0x3099
I have experienced issues with DPP running on IDF v5.2.1, out of 47 attempts I experienced a failure rate of 51%. 40 attempts were made with verbose logging enabled, and WPA logs disabled, and 7 attempts were made with both verbose logging and WPA logs enabled.
v5.2.1
ESP32-S3-DevKitC-1
Motorola Edge 2021
Android 13
Logs of all failures can be found here
Then I enabled
WPA_DEBUG_PRINT
in menuconfig and re-compiled.Surprisingly, the next 15 attempts all succeeded. Even after changing
EXAMPLE_DPP_LISTEN_CHANNEL_LIST
to 12 and letting the phone connect to the wifi.Then I disabled
WPA_DEBUG_PRINT
, re-compiled and only 4 out of 8 attempts succeeded.What a strange result.
With
WPA_DEBUG_PRINT
enabled: 7 / 8 attempts succeededSame setup but
WPA_DEBUG_PRINT
disabled: 2 / 8 attempts succeeded.
As @michael-betz has also experienced, changing the value of WPA_DEBUG_PRINT
changes the results a lot, without any other code changed.
Hi @Kaden-mpc , I am sharing a debug build for dpp_enrollee example on v5.2.1 here with some extra logs and few fixes. Will you please test with this build and share the logs and sniffer capture?
The file flash_cmd contains command for flashing. Please use that to flash the device. Thanks.
@Aditi-Lonkar I have collected 20 logs of the build you shared, 8/20 failed. I have not yet collected any network packets. I will get back to you with those, and additional logs when I get them. The logs can be found here. Thank you for jumping on this issue!
@Kaden-mpc Thanks for trying the earlier build and sharing the logs. We are actively working on the issue, however we are not able to reproduce exact same results in our setup.
I am attaching one more build here with some more fixes added. Will you please test and share the logs and sniffer capture with this too?
The fixes includes 1) retrying sending auth response for 5 times on same channel in 1second interval if we didn't get auth confirm. 2) starting timer of 1second for getting auth confirm after getting ack for auth_response 3) staying on current channel until the whole auth process finishes 4) stopping listen when dpp returns failure and some reason code related changes. 5) keep rf enable until the offchannel transmission is done 6) starting a timer to fail if config not received in 120 secs
dpp_test_build_521_2.zip The file flash_cmd contains command for flashing. Please use that to flash the device.
Thanks.
@Aditi-Lonkar thank you for the responsiveness, @Kaden-mpc and I have the ESP logs from the new version but do not have the sniffer capture we are able to share yet due to our company data security policy. We are working on an alternative way to get the packet capture and will get that over to you early next week.
Thank you again for the help!
Answers checklist.
IDF version.
v4.4.3
Operating System used.
Windows
How did you build your project?
VS Code IDE
If you are using Windows, please specify command line type.
PowerShell
Development Kit.
ESP32-WROVER-DevkitC
Power Supply used.
USB
What is the expected behavior?
I expected it to correctly authenticate with Accesspoint.
What is the actual behavior?
It does not authenticate with accesspoint at all; Seems that it fails with WPA being
D (66148) wpa: Mgmt Tx Status - 1, Cookie - 0x400e036c
, and returns ESP_ERR_DPP_TX_FAILURE no matter what accesspoint it uses.Steps to reproduce.
Code It is 1:1 to the example, expect I removed the key part. which should just work without it.
Debug Logs.
More Information.
Seems afterwards it crashes Wi-Fi module.