Heavy use of rotary encoders may cause badge to lock up

smcameron commented 1 year ago

A couple of times, I've seen the badge lock up with heavy use of rotary encoders. Not sure if the encoders caused it. Difficult to reproduce.

smcameron commented 1 year ago

So, the Tank vs. Tank game is reasonably OK at reproducing this bug, though it is still difficult. I tried a patch to track alarms, on the suspicion that we are somehow missing/racing an alarm in the button debounce code, and with this alarm tracking code, it seems much more difficult to reproduce the problem. Looking for the problem seems to some extent to cure the problem. Attached is debug-debounce.txt the patch I used to try to track the alarms.

smcameron commented 1 year ago

Hmm, thinking more about it, I doubt the problem is a dropped alarm. The alarm handler sets up the gpio irq callback for just one button. So if we just missed one alarm, that means one button should go dead. But they all go dead, and in the tank vs. tank game, the tanks stop moving. If it were just an alarm missed, only one button should go dead, and the tanks should continue moving. So that makes me think it is not a missed alarm, or at least not only a missed alarm.

paulbruggeman commented 1 year ago

I need to look at the code.

The interupt stack maybe getting overwritten because they aren't being serviced often enough when fast changes are made.

Another thought is that the rotary filtering is failing and returning values that shouldn't exist.

On Thu, Jun 01, 2023 05:13:50AM -0700, smcameron wrote:

Hmm, thinking more about it, I doubt the problem is a dropped alarm. The alarm handler sets up the gpio irq callback for just one button. So if we just missed one alarm, that means one button should go dead. But they all go dead, and in the tank vs. tank game, the tanks stop moving. If it were just an alarm missed, only one button should go dead, and the tanks should continue moving. So that makes me think it is not a missed alarm, or at least not only a missed alarm.

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1571943455 You are receiving this because you are subscribed to this thread.

Message ID: @.***>

smcameron commented 1 year ago

in source/hal/button_rp2040.c, there is some stuff around the buttons that uses critical_section_enter_blocking(), and critical_section_exit().

If those are intended to prevent racing with the interrupt handler, I don't think they do that, or at least, I could not find how they would. critical_section_enter_blocking() eventually comes down to spin_lock_blocking(), the body of which amounts to a single line: "lock->locked = true;"

void PICO_WEAK_FUNCTION_IMPL_NAME(spin_lock_unsafe_blocking)(spin_lock_t *lock) {
    lock->locked = true;
}

PICO_WEAK_FUNCTION_DEF(spin_lock_blocking)

uint32_t PICO_WEAK_FUNCTION_IMPL_NAME(spin_lock_blocking)(spin_lock_t *lock) {
    spin_lock_unsafe_blocking(lock);
    return 1; // todo wrong value
}

Also I found it weird that "./_deps/pico_sdk-src/src/host/hardware_sync/sync_core0_only.c" contains this:

uint32_t PICO_WEAK_FUNCTION_IMPL_NAME(save_and_disable_interrupts)() {
    return 0;
}

Er, how does that disable interrupts?

paulbruggeman commented 1 year ago

Hopefully lock->lock is examined by the interupt routine to prevent interupting the interupt (I think)

On Thu, Jun 01, 2023 05:40:59AM -0700, smcameron wrote:

in source/hal/button_rp2040.c, there is some stuff around the buttons that uses critical_section_enter_blocking(), and critical_section_exit().

If those are intended to prevent racing with the interrupt handler, I don't think they do that, or at least, I could not find how they would. critical_section_enter_blocking() eventually comes down to spin_lock_blocking(), the body of which is a single line: "lock->lock = 1;"

Also I found it weird that "./_deps/pico_sdk-src/src/host/hardware_sync/sync_core0_only.c" contains this:
uint32_t PICO_WEAK_FUNCTION_IMPL_NAME(save_and_disable_interrupts)() {
   return 0;
}
Er, how does that disable interrupts?

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1571979638 You are receiving this because you commented.

Message ID: @.***>

paulbruggeman commented 1 year ago

I googled "rp2040 interrupts not working"

https://github.com/arduino/ArduinoCore-mbed/issues/253

also lots of people having encoder issues

https://github.com/firmata/ConfigurableFirmata/issues/133 https://github.com/arduino/ArduinoCore-mbed/issues/253

paulbruggeman commented 1 year ago

It maybe the pin pull up/down needs to be changed even though I don't think badge is using arduino code.

https://github.com/arduino/ArduinoCore-mbed/issues/253

" the issue here is about the pull mode being overridden when attachInterrupt is being called. In fact, mbed removes the pull configuration here https://github.com/ARMmbed/mbed-os/blob/master/drivers/source/InterruptIn.cpp#L34 ; we try to restore it here based on some heuristics but it doesn't perfectly match the normal behaviour. I'll take a close look at other possible workarounds; in the meantime, moving pinMode(INPUT, PULLUP) after attachInterrupt(CHANGE) will do the trick (ugly stuff, I must admit) "

On Thu, Jun 01, 2023 11:13:29AM -0400, @.*** wrote:

I googled "rp2040 interrupts not working"

https://github.com/arduino/ArduinoCore-mbed/issues/253

also lots of people having encoder issues

https://github.com/firmata/ConfigurableFirmata/issues/133 https://github.com/arduino/ArduinoCore-mbed/issues/253

paulbruggeman commented 1 year ago

it looks like: int64_t alarm_callback(attribute((unused)) alarm_id_t id, void* user_data)

could be called while the loop in: void gpio_callback(uint gpio, attribute((unused)) uint32_t events) alarm_pool_add_alarm_in_ms(...)

is still going, so an alarm may be added: alarm_pool_add_alarm_in_ms() while one is running alarm_callback(...)

This doesn't seem like a good idea to me.

On Thu, Jun 01, 2023 11:49:32AM -0400, @.*** wrote:

It maybe the pin pull up/down needs to be changed even though I don't think badge is using arduino code.

https://github.com/arduino/ArduinoCore-mbed/issues/253

" the issue here is about the pull mode being overridden when attachInterrupt is being called. In fact, mbed removes the pull configuration here https://github.com/ARMmbed/mbed-os/blob/master/drivers/source/InterruptIn.cpp#L34 ; we try to restore it here based on some heuristics but it doesn't perfectly match the normal behaviour. I'll take a close look at other possible workarounds; in the meantime, moving pinMode(INPUT, PULLUP) after attachInterrupt(CHANGE) will do the trick (ugly stuff, I must admit) "

On Thu, Jun 01, 2023 11:13:29AM -0400, @.*** wrote:

I googled "rp2040 interrupts not working"

https://github.com/arduino/ArduinoCore-mbed/issues/253

also lots of people having encoder issues

https://github.com/firmata/ConfigurableFirmata/issues/133 https://github.com/arduino/ArduinoCore-mbed/issues/253

smcameron commented 1 year ago

There's this though, which claims it is safe, I think:

https://github.com/raspberrypi/pico-sdk/blob/master/src/common/pico_time/include/pico/time.h#L487

alarm_id_t alarm_pool_add_alarm_at(alarm_pool_t *pool, absolute_time_t time, alarm_callback_t callback, void *user_data, bool fire_if_past);

/*!
 * \brief Add an alarm callback to be called at or after a specific time
 * \ingroup alarm
 *
 * The callback is called as soon as possible after the time specified from an IRQ handler
 * on the core the alarm pool was created on. Unlike \ref alarm_pool_add_alarm_at, this method
 * guarantees to call the callback from that core even if the time is during this method call or in the past.
 *
 * \note It is safe to call this method from an IRQ handler (including alarm callbacks), and from either core.
 *
 * @param pool the alarm pool to use for scheduling the callback (this determines which hardware alarm is used, and which core calls the callback)
 * @param time the timestamp when (after which) the callback should fire
 * @param callback the callback function
 * @param user_data user data to pass to the callback function
 * @return >0 the alarm id for an active (at the time of return) alarm
 * @return -1 if there were no alarm slots available
 */

paulbruggeman commented 1 year ago

https://electronics.stackexchange.com/questions/631232/alarm-gets-instantly-fired-on-a-raspberry-rp2040

Seems this could be relevant. He also claims to have to multiply by 10 to get the real delay time.

"You should evaluate all return codes. Even if you know their value, you should at least verify it using an assertion. For instance add_alarm_in_ms() returns -1 if there were no alarm slots available, and 0 if the alarm time passed before or during the call AND there is no active alarm to return the id of"

Is it locking up the badge?

This line from the docs below is kind of unexpected to me. Seems it could create problems with debounce_alarm_count++ and debounce_alarm_count--

guarantees to call the callback from that core even if the time is during this method call or in the past.

On Thu, Jun 01, 2023 11:15:40AM -0700, smcameron wrote:

There's this though, which claims it is safe, I think:

https://github.com/raspberrypi/pico-sdk/blob/master/src/common/pico_time/include/pico/time.h#L487

alarm_id_t alarm_pool_add_alarm_at(alarm_pool_t *pool, absolute_time_t time, alarm_callback_t callback, void *user_data, bool fire_if_past);

/*!
* \brief Add an alarm callback to be called at or after a specific time
* \ingroup alarm
*
* The callback is called as soon as possible after the time specified from an IRQ handler
* on the core the alarm pool was created on. Unlike \ref alarm_pool_add_alarm_at, this method
* guarantees to call the callback from that core even if the time is during this method call or in the past.
*
* \note It is safe to call this method from an IRQ handler (including alarm callbacks), and from either core.
*
* @param pool the alarm pool to use for scheduling the callback (this determines which hardware alarm is used, and which core calls the callback)
* @param time the timestamp when (after which) the callback should fire
* @param callback the callback function
* @param user_data user data to pass to the callback function
* @return >0 the alarm id for an active (at the time of return) alarm
* @return -1 if there were no alarm slots available
*/

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1572561968 You are receiving this because you commented.

Message ID: @.***>

smcameron commented 1 year ago

So what makes me think it's not alarm related is that if it just dropped one alarm, I would kind of expect the symptom to be one button quits working because the gpio IRQ doesn't get re-enabled. Instead, all buttons quit working, and the screen doesn't update, which to me feels more like a deadlock where something is stuck on trying to lock an already locked spinlock, and nothing unlocks it. But, I could be talking cross-eyed badger spit.

The problem is also pretty hard to duplicate, but I've got the badge locked up right now in front of me, and it's unresponsive, but, hard to tell what's going on inside it. Oh, interesting, the screensaver still triggers though if you just wait long enough for it.

Maybe we could use some of the watchdog stuff to detect lockups: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/hardware_watchdog/include/hardware/watchdog.h -- though since the screensaver still triggers, I'm guessing the watchdog would not fire in this instance.

paulbruggeman commented 1 year ago

in alarm_callback()

shouldn't: process_rotary_pin_state(gpio_enum, state);

be: process_rotary_pin_state(gpio, state);

On Thu, Jun 01, 2023 11:48:54AM -0700, smcameron wrote:

So what makes me think it's not alarm related is that if it just dropped one alarm, I would kind of expect the symptom to be one button quits working because the gpio IRQ doesn't get re-enabled. Instead, all buttons quit working, and the screen doesn't update, which to me feels more like a deadlock where something is stuck on trying to lock an already locked spinlock, and nothing unlocks it. But, I could be talking cross-eyed badger spit.

The problem is also pretty hard to duplicate, but I've got the badge locked up right now in front of me, and it's unresponsive, but, hard to tell what's going on inside it. Oh, interesting, the screensaver still triggers though if you just wait long enough for it.

Maybe we could use some of the watchdog stuff to detect lockups: https://github.com/raspberrypi/pico-sdk/blob/master/src/rp2_common/hardware_watchdog/include/hardware/watchdog.h -- though since the screensaver still triggers, I'm guessing the watchdog would not fire in this instance.

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1572602176 You are receiving this because you commented.

Message ID: @.***>

smcameron commented 1 year ago

It does kind of look that way... but how could it even begin to work if it's that wrong? LOL. Is this going to be one of those bugs that now that we've noticed it, everything stops working?

And now I have a locked up badge, and the screensaver is not activating this time.

smcameron commented 1 year ago

Just tried this change:

diff --git a/source/hal/button_rp2040.c b/source/hal/button_rp2040.c
index 9abc028d..83395717 100644
--- a/source/hal/button_rp2040.c
+++ b/source/hal/button_rp2040.c
@@ -91,7 +91,7 @@ int64_t alarm_callback(__attribute__((unused)) alarm_id_t id, void* user_data) {
         if (user_cb) {
             user_cb(gpio_enum, state);
         }
-        process_rotary_pin_state(gpio_enum, state);
+        process_rotary_pin_state(gpio, state);
         last_change = rtc_get_ms_since_boot();
     }

Rotary encoders quit working... which is both surprising and not surprising.

smcameron commented 1 year ago

Ok, I see

gpio_enum should really be called badge_button_enum, the enums that start BADGEBUTTON... are indexes into button_gpios[] and the values in button_gpios[] are the gpio pin numbers.

process_rotary_pin_state() wants the index, not the gpio pin number, as you can see it comparing gpio against various BADGEBUTTON... enum values.

smcameron commented 1 year ago

Do we have something like this that can debug the badge: https://www.raspberrypi.com/documentation/microcontrollers/debug-probe.html

paulbruggeman commented 1 year ago

lol it works because the enums for BADGE_BUTTON are the same as the const button_gpios[BADGE_BUTTON_MAX] veryyyy sketchy

Are you testing with USB power? Maybe try testing with the battery?

It is possible the badge is using too much power when spinning the encoder. Doesn't seem possible, but a spike in processing could Do Somethine Bad.

On Thu, Jun 01, 2023 12:03:47PM -0700, smcameron wrote:

It does kind of look that way... but how could it even begin to work if it's that wrong? LOL. Is this going to be one of those bugs that now that we've noticed it, everything stops working?

And now I have a locked up badge, and the screensaver is not activating this time.

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1572624162 You are receiving this because you commented.

Message ID: @.***>

paulbruggeman commented 1 year ago

Yes, I have one that works with openOCD. It was quite difficult to use last time I worked with it. Kinda like gdb but command line only.

On Thu, Jun 01, 2023 12:15:12PM -0700, smcameron wrote:

Do we have something like this that can debug the badge: https://www.raspberrypi.com/documentation/microcontrollers/debug-probe.html

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1572637686 You are receiving this because you commented.

Message ID: @.***>

smcameron commented 1 year ago

I was using USB, but just tried it with battery and was able to lock it up after a few minutes playing with the tank vs tank game.

paulbruggeman commented 1 year ago

I just tested with my badge from sunday and it locked up lots of noise when just spinning the encoder on the menus.

On Thu, Jun 01, 2023 12:26:02PM -0700, smcameron wrote:

I was using USB, but just tried it with battery and was able to lock it up after a few minutes playing with the tank vs tank game.

-- Reply to this email directly or view it on GitHub: https://github.com/HackRVA/badge2023/issues/12#issuecomment-1572652358 You are receiving this because you commented.

Message ID: @.***>

smcameron commented 1 year ago

I just tested with my badge from sunday and it locked up lots of noise when just spinning the encoder on the menus.

We determined last night that this particular badge had very old firmware, and were not able to duplicate the problem last night on that badge once the firmware was updated.

This morning I just played Tank vs. Tank for 30 solid minutes with no problems on this very same (unnumbered) badge Paul was using with no problems at all. So now I am inclined to think that the problem I was seeing before is something specific to badge number 1 that I was previously using.

smcameron commented 1 year ago

I added the following patch to detect running out of alarms:

diff --git a/source/hal/button_rp2040.c b/source/hal/button_rp2040.c
index 905ba2d0..f9a4f800 100644
--- a/source/hal/button_rp2040.c
+++ b/source/hal/button_rp2040.c
@@ -120,11 +120,13 @@ void gpio_callback(uint gpio, __attribute__((unused)) uint32_t events) {
              * See: pico-sdk-src/src/common/pico_time/include/pico/time.h line 477
              * and surrounding comments.
              */
-            alarm_pool_add_alarm_in_ms(alarm_pool_get_default(),
+            alarm_id_t a = alarm_pool_add_alarm_in_ms(alarm_pool_get_default(),
                                        DEBOUNCE_DELAY_MS,
                                        alarm_callback,
                                        (void*) button,
                                        true);
+            if ((int) a == -1)
+               printf("gpio_callback: out of alarms\n");
         }
     }
 }

I ran the badge with this patch and monitored it with minicom, and twisted the rotary knobs as fast as I could (short of wrapping a string around the knob and pull-starting the badge) and never saw any messages about running out of alarms.

smcameron commented 1 year ago

I'm going to close this issue now. If we see similar problems on other badges, we can re-open it.

HackRVA / badge2023

Heavy use of rotary encoders may cause badge to lock up #12