Closed nand-nor closed 2 months ago
Thanks! I guess having these checks won't hurt even once we identify the root cause (but definitely finding that would be great).
Do you see the warn
from https://github.com/esp-rs/esp-hal/blob/6b6e628940eec34e9a4cf0f68bfc3a7b9aac76e6/esp-ieee802154/src/raw.rs#L399 when this happens? Because in that code we previously already enqueued the faulty frame for processing 🤔
Thanks! I guess having these checks won't hurt even once we identify the root cause (but definitely finding that would be great). Do you see the
warn
from https://github.com/esp-rs/esp-hal/blob/6b6e628940eec34e9a4cf0f68bfc3a7b9aac76e6/esp-ieee802154/src/raw.rs#L399 when this happens? Because in that code we previously already enqueued the faulty frame for processing 🤔
Yea I do see that log since I applied that esp-hal
fix locally, so that is the main driver behind my suspicion that something is happening at lower layer processing that passes up a bad frame (or, at least a frame with a bad length). So I think that somewhat helps divide the problem space a bit as long as it is not a red herring.
Hopefully will have free time soon to do some static analysis of the crates ( esp-ieee802154
or and it's dependencies like the ieee802154
crate) involved in the rcv path to see if that knocks loose ideas for further root causing. Would probably also be good to dust off my (very minimal) OpenOCD skills and see if I can set up gdb...
Description
Some minor band-aid fixes to these PRs in
esp-hal
, both panics have shown up on esp32c6 and esp32h2 devices.No root cause yet (I have not had much time to spend on this lately) but may be worth noting this arose with the fixes applied in the linked PRs. So this may be indication that its a bug in the phy / mac layer logic, potentially in the
ieee802154
crate or somewhere inesp-ieee802154
or one of the other dependencies. I have very limited evidence to base this on other than the output at the time of the panic, log shows whereesp-ieee802154
caught and handled it without modifying the buffer, so the (invalid?) frame was passed as-is toesp-openthread
for further processing & panics via similar logic. May be a red herring but sharing in case it is helpful.More details
Theres two fixes included in this commit, one for avoiding a panic while processing a received frame and one for avoiding a panic during rssi parsing. Similar to this this fixes a panic that arises when
esp-openthread
is passed a frame with a 0th index containing a length larger than the receiving buffer. The other fix handles a panic I saw a handful of times, only on esp32h2 devices, at first start up, where checking for rssi panics because the value atraw.data[0]
is 0, so have included a check for that. I can share log snippets too if helpful.Also worth noting for these band-aids I opted to prioritize logging in the checks even though it makes things terribly long / messy to look at. Im on the fence as to whether it is actually worth it, so if there is a preference let me know & I'll fixup. For example the rcv frame parse check could be distilled down to something much cleaner like:
Testing
The test for the rcv buffer fix is not exhaustive as it is not yet possible for me to consistently reproduce, but I have run the fix with the
esp-openthread
example bin and have not observed any regressions. The test checking the other rssi-related fix was easier as that did pop up somewhat frequently, once every 10 runs or so only on esp32h2 platforms. Im running longevity tests now and will follow up as needed.