dhiltonp / hexbright

The easiest way to start programming your hexbright.
BSD 2-Clause "Simplified" License
124 stars 114 forks source link

Intermittent hour-long hangs due to timestamp-overflow #38

Closed AustinSpafford closed 11 years ago

AustinSpafford commented 11 years ago

When micros() wraps around (roughly once an hour), hexbright::update() will intermittently hang or resume processing due to "now" wrapping around while "continue_time" is stuck near the high-end. Thankfully the fix is simple, and has been tested by using the modified hexbright library for a few days of continuous operation without issue.

This code:

    do {
      now = micros();
    } while (continue_time > now); // not ready for update

...needs to be changed to:

    do {
      now = micros();
    } while ((signed long)(continue_time - now) > 0); // not ready for update

I hope that helps! :)

dhiltonp commented 11 years ago

I'll have to go over the continue_time related code later to double check for the same bug (and revisit alternative implementations), but for the time being, I've applied your fix.

AustinSpafford commented 11 years ago

I'm glad to help! :)

By the way, for a more detailed explanation of what happens, below is the email I sent to my dad just after trying out the fix. As for the project where this occurred, as part of a recovery from a serious accident, I used the hexbright to build an animated tron-themed cane to help remove some of the stigma. It was fantastically successful (garnered tons of praise at PAX Prime), and I intend to post a video of it and share the source code once it gets approved by the lawyer for my case (widely-shared information is apparently something to be careful about).

Thank you for making the hexbright easier to use! Given how people responded, the end product felt like a magic show. :)

Playing with numbers here told me what to data-storage and time-units to look for: http://www.wolframalpha.com/input/?i=2%5E32+microseconds

Then I found this throttle in the library I'm using, which uses 32-bit unsigned storage and microseconds, exactly the suspect I'm looking for! do { now = micros(); } while (continue_time > now); // not ready for update

This is a classic bug, usually seen with occasional hangs in programs that have been running for 49.7 days (2^32 milliseconds). The fixed code is simply: do { now = micros(); } while ((signed long)(continue_time - now) > 0); // not ready for update

It's an evil bug, because each hour when the values returned by "micros()" wrap around, there's a random chance that "now" overshoots "continue_time", wrapping around and then being an entire hour behind it (which is why the cane appeared to lock up for about an hour). Even worse, when that hour is up, there's a small chance it will overshoot again and be hung for another hour.

I'll test it tomorrow, but it should be easier for me to sleep knowing that I'm damn-sure I found the bug!

Once I test out the fix and become damn certain it's correct, I'll notify the hexbright programmer so he can fix the bug in their library. :)

-Austin

P.S. While this was a relatively easy bug to fix, it's definitely rejuvenating to do something like that!

On Fri, Sep 6, 2013 at 5:23 AM, David Hilton notifications@github.comwrote:

I'll have to go over the continue_time related code later to double check for the same bug (and revisit alternative implementations), but for the time being, I've applied your fix.

— Reply to this email directly or view it on GitHubhttps://github.com/dhiltonp/hexbright/issues/38#issuecomment-23936129 .

dhiltonp commented 11 years ago

As soon as you sent in the bug I knew what condition you were talking about; I'm mostly surprised as I've never seen the bug in practice.

I've run code for at least 8 hours and not hit this bug. That's got to be due to a slight timing issue. Maybe your code is taking a little too long to run? If you set DEBUG to DEBUG_LOOP in libraries/hexbright/hexbright.h, it will tell you your average loop time.

I've just run a sample program; In a non-debug mode, continue_time is initially set to 1136 microseconds. I plugged in the exact formula into a loop that would cycle, and print the difference from maxint if we were within 20000 of maxint. Here is the output (how close continue_time gets to maxint): starting values: 1136 starting values: 9470 starting values: 17803 countdown to overflow: -19711 countdown to overflow: -11263 countdown to overflow: -2815 overflowed!: 1 overflowed!: 8334 overflowed!: 16667 countdown to overflow: -12287 countdown to overflow: -3839 overflowed!: 1 overflowed!: 8334 overflowed!: 16667 countdown to overflow: -12287 countdown to overflow: -3839 overflowed!: 1 overflowed!: 8334 overflowed!: 16667 countdown to overflow: -12287 countdown to overflow: -3839 overflowed!: 1 overflowed!: 8334 overflowed!: 16667 countdown to overflow: -12287 countdown to overflow: -3839

Here is the relevant code: const float update_delay = 8.3333333; // in lock-step with the accelerometer unsigned long continue_time = 1136; // actual startup time unsigned long maxint = 4294967295; while(true) { continue_time = continue_time+(1000*update_delay); if(continue_time>4294947296) { Serial.print("countdown to overflow: -"); Serial.println(maxint-continue_time); } else if (continue_time<20000) { Serial.print("overflowed!: "); Serial.println(continue_time-maxint); } }

I'm actually quite surprised by the lack of proper overflow - though it does make this analysis quite simple. It does seem to indicate that your program is taking a little too long to execute - effectively, you're missing some of the accelerometer samples and light change operations take a little longer than requested (and you hit this freezing bug).

By the way, there is an onboard reset switch that could have been useful.

AustinSpafford commented 11 years ago

Indeed, at the time I was running into it because some lazy serial-output was dragging the flashlight down to 10 Hz. Come to think of it, that explains quite clearly why I was seeing it reliably once an hour, and why it wasn't sticking for subsequent hours (the first iteration was lapsing in my heinously slow output code, but the throttle was breaking out due to the shear update-frequency). Hmm, it even explains why it seemed to only stick while in awake-mode (versus sleep-mode), since awake-mode was pouring out about three times the debug text (especially trying to disambiguate walking-steps from other motions).

That night I of course went through and trimmed down all the serial output to be just event-based messages. Besides, the spam of orientation and light-level data was no longer needed.

However, I learned my lesson that day, since I had not brought the tools needed to open the cane and get to the flashlight to reset it. It did lead to a funny situation where I said "I'm sorry, but my cane appears to have crashed!", much to the amazement of those I was apologizing to.

On Fri, Sep 6, 2013 at 2:43 PM, David Hilton notifications@github.comwrote:

As soon as you sent in the bug I knew what condition you were talking about; I'm mostly surprised I've never seen the bug in practice.

I've run code for at least 8 hours and not hit this bug. That's got to be due to a slight timing issue. Maybe your code is taking a little too long to run? If you set DEBUG to DEBUG_LOOP in libraries/hexbright/hexbright.h, it will tell you your average loop time.

I've just run a sample program; In a non-debug mode, continue_time is initially set to 1136 microseconds. I plugged in the exact formula into a loop that would cycle, and print the difference from maxint if we were within 20000 of maxint. Here is the output (how close continue_time gets to maxint): starting values: 1136 starting values: 9470 starting values: 17803 countdown to overflow: -19711 countdown to overflow: -11263 countdown to overflow: -2815 overflowed!: 1 overflowed!: 8334 overflowed!: 16667 countdown to overflow: -12287 countdown to overflow: -3839 overflowed!: 1 overflowed!: 8334 overflowed!: 16667

I'm actually quite surprised by the lack of proper overflow - though it does make this analysis quite simple. It does seem to indicate that your program is taking a little too long to execute - effectively, you're missing some of the accelerometer samples and light change operations take a little longer than requested (and you hit this freezing bug).

By the way, there is an onboard reset switch that could have been useful.

— Reply to this email directly or view it on GitHubhttps://github.com/dhiltonp/hexbright/issues/38#issuecomment-23971330 .

dhiltonp commented 11 years ago

So, this is executed as a floating point operation, and it has the reset to 1 problem. continue_time = continue_time+(1000*update_delay);

This doesn't have the problem, though. continue_time = continue_time+(int)(1000*update_delay);

I've applied this fix.

AustinSpafford commented 11 years ago

I'm confused, since continue_time has to wrap around about once an hour, as the restriction is micros() itself. Are you saying that this was in addition to fixing the while loop?

That's a good find regarding the implicit float-to-int cast. I'm not sure how 8333 compares versus the resolution near (float)2^32, but the primary danger would be continue_time failing to increment, rather than it wrapping around (large-float plus tiny-float equals unchanged large-float).

Hmm, attacking the root of the problem, would it make sense to instead redefine update_delay as update_delay_micros?

In any case, thank you for writing the library, and good luck!

On Sep 7, 2013, at 6:12 AM, David Hilton notifications@github.com wrote:

So, this is executed as a floating point operation, and it has the reset to 1 problem. continue_time = continue_time+(1000*update_delay);

This doesn't have the problem, though. continue_time = continue_time+(int)(1000*update_delay);

I've applied this fix.

— Reply to this email directly or view it on GitHub.