betrusted-io / betrusted-ec

Betrusted embedded controller (UP5K)
Other
44 stars 6 forks source link

EC firmware contains many unbounded `while...{}` loops that could block the control loop #3

Open samblenny opened 3 years ago

samblenny commented 3 years ago

The EC firmware has a lot of unbounded while loops that could block the control loop in the event of a peripheral getting into an unanticipated state.

Converting the while loops to for loops with timeout error handling seems like a potentially fruitful opportunity to improve the stability and responsiveness of the EC firmware control loop.

As an example of what I mean by unbounded loops:

$ cd betrusted-ec
$ grep -r 'while.*{}' *
sw/wfx_rs/src/hal_wf200.rs:            while wifi_csr.rf(utra::wifi::STATUS_TIP) == 1 {}
sw/wfx_rs/src/hal_wf200.rs:                while wifi_csr.rf(utra::wifi::STATUS_TIP) == 1 {}
sw/wfx_rs/src/hal_wf200.rs:                while wifi_csr.rf(utra::wifi::STATUS_TIP) == 1 {}
sw/wfx_rs/src/hal_wf200/debug.rs:        while uart_csr.rf(utra::uart::TXFULL_TXFULL) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txbuf), Some(&mut rxbuf), TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txbuf), Some(&mut rxrev), TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txwrbuf), None, TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txwrbuf), None, TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txbuf), Some(&mut status_regs), TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:        while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txwrbuf), None, TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:            while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txbuf), Some(&mut status_regs), TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_tusb320.rs:            while i2c.i2c_controller(TUSB320LAI_ADDR, Some(&txwrbuf), None, TUSB320_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:            while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), Some(&mut rxbuf), CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), Some(&mut rxbuf), CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf2), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf3), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_charger.rs:        while i2c.i2c_controller(BQ24157_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:            while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), Some(&mut rxbuf), CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:            while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), Some(&mut rxbuf), CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), Some(&mut rxbuf), CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_bq25618.rs:        while i2c.i2c_controller(BQ25618_ADDR, Some(&txbuf), None, CHG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:            while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:            while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:            while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:            while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:                while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_lm3509.rs:            while i2c.i2c_controller(LM3509_ADDR, Some(&txbuf), None, BL_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_gasgauge.rs:    while i2c.i2c_controller(BQ27421_ADDR, Some(&txbuf), None, GG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_gasgauge.rs:    while i2c.i2c_controller(BQ27421_ADDR, Some(&txbuf), None, GG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_gasgauge.rs:    while i2c.i2c_controller(BQ27421_ADDR, Some(&txbuf), Some(&mut rxbuf), GG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/api_gasgauge.rs:    while i2c.i2c_controller(BQ27421_ADDR, Some(&txbuf), Some(&mut rxbuf), GG_TIMEOUT_MS) != 0 {}
sw/betrusted-hal/src/debug.rs:        while uart_csr.rf(utra::uart::TXFULL_TXFULL) != 0 {}
sw/src/debug.rs:        while uart_csr.rf(utra::uart::TXFULL_TXFULL) != 0 {}
bunnie commented 3 years ago

All of the i2c_controller() calls already have a timeout embedded in them. I don't think it makes sense to wrap a time-out around a loop that already contains a timeout - it takes the time-out code and splatters it across 50 instances instead of putting it in one place to maintain.

If you are willing to accept that the i2c timeout structure is done correctly, that leaves two types of unbounded loops:

UART TX FULL could use a timeout on it, but it will lead to lost characters if the receiving computer is slow or not starting up. The work-around was to not print stuff at all when not debugging -- this has the added benefit of improving the performance of the EC; executing print statements add significant lag to the EC's performance, but sure, some sort of timeout could make sense there. I think I may have seen some PR you made to fix that anyways.

Wifi STATUS TIP - this one is more ambiguous if there is a benefit to doing a timeout. Small wifi commands send in about 16 CPU cycles or about 8 instructions, so pretty much by the time you run through one iteration of a checked loop you would be moving on. So it's more a performance concern. If the wifi link ends up being unstable, then adding something here that also has an error handler of resetting the wifi chip in the event of a time out makes a lot of sense.

However, simply timing out and "moving on" without noting that a wifi command had failed somehow may improve the responsiveness of the EC but would probably also make debugging network connectivity issues a lot harder. I think in a way maybe having it hang on or near the offending Wifi command is a feature not a bug because at least we can stop the EC execution and look at the current address in GDB (assuming you've left the debug bridge compiled into the gateware) and determine what the offending sequence of events was, as opposed to only figuring out something broke much later and trying to trace back through that.

bunnie commented 3 years ago

maybe you could turn this into a PR?

https://github.com/samblenny/betrusted-ec/commit/f56c1719e775e61dc478f596560790f74864ca3c

samblenny commented 3 years ago

For context, I'm thinking of rule number 2 from Gerard J. Holzmann's Power of 10 rules for developing safety critical code:

  1. All loops must have fixed bounds. This prevents runaway code.

The problem is that the functions in the conditionals for the while loops are not guaranteed to ever return the ending value. Some unexpected thing could happen (perhaps a loose GPIO wire at the top FPC pads?) to invalidate the assumptions that normally made the while loop safe, and then the firmware would get wedged in an infinite loop. Perhaps the while loops could be moved inside of the timeout checks? Or, switching to something like for _ in 0..999 { if ... { break; } } would provide a failsafe.

But, failing that safety check would need to trigger some type of handler for serious errors. Maybe a reset and retry, logging an error code if that was possible under the circumstances, or perhaps something as extreme as halting the CPU. A related data point is that, based on my experiments with serial port stuff, it seems that tight while loops that bang on the wishbone bus may be able to DoS wishbone-tool and prevent it from connecting.

That said, I've been reading the code for wrapping the hard I2C block, and I see that messing with the timing around the I2C calls in any way would probably be very problematic. I'm not excited about opening that can of worms.

For the serial TX buffer stuff, I think that commit is still incomplete. It seems like some of the hal drivers use another copy of debug.rs (sw/src/debug.rs vs. sw/betrusted-hal/src/debug.rs). Also, my working assumption is that, when forced to chose, it's probably better to drop serial debug characters than to get stuck in an infinite loop that blocks access to the COM bus.

Retaining the ability to to inform Xous about errors over the COM bus might be the biggest incentive to avoid getting stuck in loops. If the EC can talk to Xous, Xous can show an error message on the screen. That will be helpful for dealing with errors that happen in the wild rather than strictly on a workbench.

If you're interested in a PR that's limited to better error recovery around full TX buffers, I can look into that.

bunnie commented 3 years ago

But, failing that safety check would need to trigger some type of handler for serious errors. Maybe a reset and retry, logging an error code if that was possible under the circumstances, or perhaps something as extreme as halting the CPU. A related data point is that, based on my experiments with serial port stuff, it seems that tight while loops that bang on the wishbone bus may be able to DoS wishbone-tool and prevent it from connecting.

I think that's the big question -- what, then, do you do if the loop times out? Simply returning to the waiting function doesn't make things better. There is a WDT function, you could have a timed-out loop just reset the firmware, I suppose. But keep in mind you're running on like...a 5MIPS MCU, and error handling code has a real performance penalty.

Retaining the ability to to inform Xous about errors over the COM bus might be the biggest incentive to avoid getting stuck in loops. If the EC can talk to Xous, Xous can show an error message on the screen.

Sure...if you're excited about writing an error handling and reporting stack for the EC, all the more power to you! But doing this is probably beyond my ability. I am still not terribly comfortable with Rust's built-in error handling mechanisms.

A related data point is that, based on my experiments with serial port stuff, it seems that tight while loops that bang on the wishbone bus may be able to DoS wishbone-tool and prevent it from connecting.

Very likely. Wishbone-tool has a lot of stability issues, but it's a deugging tool and those tend to be held together with tape and glue. I usually just get them to work only well enough to debug the problem and no further.

If you're interested in a PR that's limited to better error recovery around full TX buffers, I can look into that.

Yes, targeted PRs are great. Small incremental changes that are easy to merge are the best.

If even the commit you referenced was a PR that'd be great. I guess I could also cherry-pick the commit out but generally PRs are preferred to cherry-picked commits, they reduce the friction to incorporate code.

samblenny commented 3 years ago

Okay, I will try to work up a PR that specifically addresses the TX buffer thing. I will also try to figure out if I can reasonably consolidate the two copies of debug.rs

samblenny commented 3 years ago

Sure...if you're excited about writing an error handling and reporting stack for the EC, all the more power to you!

I expect I may need to build something along those lines in order to have any hope of making reasonable progress with the network stack. Serial debug might be enough. Not sure.

samblenny commented 3 years ago

TX buffer flow control PR is here: Add TX buffer flow control to EC serial debug #4

bunnie commented 3 years ago

thanks a ton for that. I really appreciate the extra effort to format the patch!

samblenny commented 3 years ago

You're welcome.

samblenny commented 2 years ago

As far as I'm concerned this can be closed, unless you want to keep it around. Seems like the EC UART stability issues have been fixed and the I2C code appears to work reliably as-is.