Fixes for array access panics

nand-nor commented 2 months ago

Description

Some minor band-aid fixes to these PRs in esp-hal, both panics have shown up on esp32c6 and esp32h2 devices.

No root cause yet (I have not had much time to spend on this lately) but may be worth noting this arose with the fixes applied in the linked PRs. So this may be indication that its a bug in the phy / mac layer logic, potentially in the ieee802154 crate or somewhere in esp-ieee802154 or one of the other dependencies. I have very limited evidence to base this on other than the output at the time of the panic, log shows where esp-ieee802154 caught and handled it without modifying the buffer, so the (invalid?) frame was passed as-is to esp-openthread for further processing & panics via similar logic. May be a red herring but sharing in case it is helpful.

More details

Theres two fixes included in this commit, one for avoiding a panic while processing a received frame and one for avoiding a panic during rssi parsing. Similar to this this fixes a panic that arises when esp-openthread is passed a frame with a 0th index containing a length larger than the receiving buffer. The other fix handles a panic I saw a handful of times, only on esp32h2 devices, at first start up, where checking for rssi panics because the value at raw.data[0] is 0, so have included a check for that. I can share log snippets too if helpful.

Also worth noting for these band-aids I opted to prioritize logging in the checks even though it makes things terribly long / messy to look at. Im on the fence as to whether it is actually worth it, so if there is a preference let me know & I'll fixup. For example the rcv frame parse check could be distilled down to something much cleaner like:

let len = (raw.data[0] as usize).min(RCV_FRAME_PSDU.len()).min(raw.data[1..].len());

Testing

The test for the rcv buffer fix is not exhaustive as it is not yet possible for me to consistently reproduce, but I have run the fix with the esp-openthread example bin and have not observed any regressions. The test checking the other rssi-related fix was easier as that did pop up somewhat frequently, once every 10 runs or so only on esp32h2 platforms. Im running longevity tests now and will follow up as needed.

bjoernQ commented 2 months ago

Thanks! I guess having these checks won't hurt even once we identify the root cause (but definitely finding that would be great). Do you see the warn from https://github.com/esp-rs/esp-hal/blob/6b6e628940eec34e9a4cf0f68bfc3a7b9aac76e6/esp-ieee802154/src/raw.rs#L399 when this happens? Because in that code we previously already enqueued the faulty frame for processing 🤔

nand-nor commented 2 months ago

Thanks! I guess having these checks won't hurt even once we identify the root cause (but definitely finding that would be great). Do you see the warn from https://github.com/esp-rs/esp-hal/blob/6b6e628940eec34e9a4cf0f68bfc3a7b9aac76e6/esp-ieee802154/src/raw.rs#L399 when this happens? Because in that code we previously already enqueued the faulty frame for processing 🤔

Yea I do see that log since I applied that esp-hal fix locally, so that is the main driver behind my suspicion that something is happening at lower layer processing that passes up a bad frame (or, at least a frame with a bad length). So I think that somewhat helps divide the problem space a bit as long as it is not a red herring.

Hopefully will have free time soon to do some static analysis of the crates ( esp-ieee802154 or and it's dependencies like the ieee802154 crate) involved in the rcv path to see if that knocks loose ideas for further root causing. Would probably also be good to dust off my (very minimal) OpenOCD skills and see if I can set up gdb...

esp-rs / esp-openthread