adafruit / Adafruit_TinyUSB_Arduino

Arduino library for TinyUSB
MIT License
465 stars 120 forks source link

Serial crash when receiving incoming data faster than handled #293

Open pludov opened 1 year ago

pludov commented 1 year ago

Operating System

Linux

Arduino IDE version

Platform.io with

Board

pico

ArduinoCore version

earlephilhower#master

TinyUSB Library version

2.1.0

Sketch as ATTACHED TXT

The attached program just returns input or output '.' when no output is available, as fast as possible. It limits the rate to 1000 characters per second, by using a delay in the loop.

serial-crash.zip

Using that program writing data from the host will quickly crash the pico (like by pasting 10 lines of codes again and again) When crashed, the pico won't reboot for sketch upload.

Removing the delay in the loop seems to avoid crash of the program.

I've tried running on the second core, with the same results.

Compiled Log as ATTACHED TXT

No log available

What happened ?

The pico becomes unresponsive. Serial seems dead.

How to reproduce ?

Run the attached platform io program Watch the serial monitor Send bunch of data

The output will stop at some point, the pico won't even reboot.

Debug Log

No response

Screenshots

No response

pludov commented 1 year ago

I could plug a debugger and got the following stack-trace on a crashed pico. It seems the ISR hangs when trying to acquire a lock that is already acquired by the interrupted code.

Maybe TinyUSB_Device_FlushCDC should acquire __usb_mutex from Adafruit_TinyUSB_rp2040.cpp, like it's done for TinyUSB_Device_Task ?

Stacktrace of the stuck pico :

    __best_effort_wfe_or_timeout_veneer@0x20000788 (Unknown Source:0)
    mutex_enter_block_until@0x2000041c (/mutex_enter_block_until.dbgasm:34)
    mutex_enter_timeout_ms@0x20000458 (/mutex_enter_timeout_ms.dbgasm:24)
    osal_mutex_lock@0x10008308 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/osal/osal_pico.h:89)
        mutex_hdl = 0x2000206c <_ubsd_mutexdef>
    tu_edpt_claim@0x10008308 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/tusb.c:121)
    usbd_edpt_claim@0x10007240 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/device/usbd.c:1216)
    _prep_out_transaction@0x10004a4e (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/class/cdc/cdc_device.c:93)
    cdcd_xfer_cb@0x10004ff2 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/class/cdc/cdc_device.c:453)
    tud_task_ext@0x10007858 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/device/usbd.c:556)
    tud_task@0x100048ee (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/device/usbd.h:55)
    usb_task_irq@0x100048ee (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/arduino/ports/rp2040/Adafruit_TinyUSB_rp2040.cpp:86)
    <signal handler called>@0xfffffff9 (Unknown Source:0)
    mutex_enter_block_until@0x20000406 (/mutex_enter_block_until.dbgasm:26)
    mutex_enter_timeout_ms@0x20000458 (/mutex_enter_timeout_ms.dbgasm:24)
    osal_mutex_lock@0x10008308 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/osal/osal_pico.h:89)
        mutex_hdl = 0x2000206c <_ubsd_mutexdef>
    tu_edpt_claim@0x10008308 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/tusb.c:121)
    usbd_edpt_claim@0x10007240 (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/device/usbd.c:1216)
    tud_cdc_n_write_flush@0x10004b8c (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/class/cdc/cdc_device.c:194)
    TinyUSB_Device_FlushCDC@0x10003f0e (/home/ludovic/.platformio/packages/framework-arduinopico/libraries/Adafruit_TinyUSB_Arduino/src/arduino/Adafruit_TinyUSB_API.cpp:54)
    yield@0x10008516 (/home/ludovic/.platformio/packages/framework-arduinopico/cores/rp2040/delay.cpp:52)
    __loop@0x10008796 (/home/ludovic/.platformio/packages/framework-arduinopico/cores/rp2040/main.cpp:66)
anrp-tri commented 1 year ago

FYI I ran in to a similar looking issue, so I speculatively tried this on a similar (?) codebase, ^ and it appears to have significantly improved on the issue of serial getting stuck (note, it still appears to happen, but less often.)

doctea commented 1 year ago

Hi, I'm also running into a similar problem, where the RP2040 appears to lock up/freeze when connected over USB (serial and MIDI device). If I put a debugger on it, I'm also seeing that it seems to get stuck forever in __best_effort_wfe_or_timeout. The time it takes before this happens seems to vary, but nonetheless happens a lot.

Serial isn't actually necessary for my app, as its mainly a MIDI thing, but nevertheless I find that if anything is sent on the serial connection but the host doesn't open the connection, the rp2040 will freeze; while it also seems to randomly freeze after a while anyway even with the host servicing the serial, but nothing being sent to it.

@anrp-tri, I've attempted to apply your patch hoping it would ameliorate the problem, but when trying to compile the latest TinyUSB version with your patch applied, I hit up against problems. Are there any changes necessary to make it work with the up-to-date TinyUSB lib?

doctea commented 1 year ago

Well, I managed to get @anrp-tri 's patch to apply OK in the end, but it hasn't seemed to improve much for me.

I'm unclear if the problem I was running into is the same one reported here, but for me I was getting repeatedly and intermittently getting stuck in _best_effort_wfe_ortimeout and finding the timeout set to OSAL_TIMEOUT_WAIT_FOREVER, so I guess I was stuck in a deadlock and never returning.

Applying the patch below seems to have worked around this issue for me -- no other side effects observed yet, but maybe this causes problems elsewhere?

index 95cb55a..7dccf4e 100644
--- a/src/tusb.c
+++ b/src/tusb.c
@@ -118,7 +118,7 @@ bool tu_edpt_claim(tu_edpt_state_t* ep_state, osal_mutex_t mutex)

   // pre-check to help reducing mutex lock
   TU_VERIFY((ep_state->busy == 0) && (ep_state->claimed == 0));
-  (void) osal_mutex_lock(mutex, OSAL_TIMEOUT_WAIT_FOREVER);
+  (void) osal_mutex_lock(mutex, OSAL_TIMEOUT_NORMAL); //OSAL_TIMEOUT_WAIT_FOREVER);^M

   // can only claim the endpoint if it is not busy and not claimed yet.
   bool const available = (ep_state->busy == 0) && (ep_state->claimed == 0);
SeongGino commented 9 months ago

Oh my god is this actually my issue the whole time?

I'm currently working on an open source lightgun project using the RP2040 as my primary target and reference, using TinyUSB. At a certain point, one of the feature requests I've gotten is implementing support for MAMEHOOKER, which is a PC-side program that, among other things, sends a constant stream of serial commands.

After some time, however, the same exact thing being reported here happens; the serial line dies, the board becomes unresponsive, whatever buttons that were depressed stay stuck on/off, and I have to physically pull and reinsert the USB connection to get it back up and running.

This only happens with serial, but not just with ingest - even with adding just a simple Serial.println("text"); in the main program loop, doing this and making the device send USB data at the same time (in this case, mouse/keyboard updates or presses) will cause it to crash randomly. It's completely unpredictable when, or even if this happens.

I've tried to limit how often either serial is read or the button signals are sent (at this point now, serial and USB data essentially takes turns), but no matter what I do the crashing will happen eventually. And in this case, it's a key cornerstone of a feature request I've been adamant on implementing lately.

SeongGino commented 9 months ago

I did start hosting a fork with anrp-tri's fix combined with doctea's addendum in the hopes that it resolves the issue in the short term.

However, I did just receive a crash (lockup) report, even with the patched core. Is @doctea's patch applied on top of anrp-tri's modified core, or to upstream?

Now I'm wondering; does this happen when using Pico SDK's USB stack? Or is this really exclusive to TinyUSB? I would test, but some of the input libraries I use rely on TinyUSB.

I've also also noticed we're all using RP2040 hardware - is this exclusive to the Pico and its derivatives? Or has it also happened on other boards when using the TinyUSB stack?

It seems like #238 is related.

doctea commented 9 months ago

I did start hosting a fork with anrp-tri's fix combined with doctea's addendum in the hopes that it resolves the issue in the short term.

However, I did just receive a crash (lockup) report, even with the patched core. Is @doctea's patch applied on top of anrp-tri's modified core, or to upstream?

My patch applies fine onto the upstream without needing to apply anrp-tri's patches first.

I've just ended up reinstalling my libraries and so lost all the patches I'd previously applied to it, and had the dreaded return of these intermittent freezes. Since applying my patch from up above, I haven't had a freeze so far. As the freezes were intermittent anyway its too early to say for sure, but this might have solved the problem entirely for me. Quite possible that @anrp-tri's patch fixes something important too, or that there are side-effects from my fix that I haven't noticed yet.

I'll post back if I start getting crashes again!

(edit -- I also seem to be able to print to the USB serial monitor without crashing now too -- not sure if this is because of this fix, or something else that's changed, but definitely not something I was able to do before!)

SeongGino commented 9 months ago

I did start hosting a fork with anrp-tri's fix combined with doctea's addendum in the hopes that it resolves the issue in the short term. However, I did just receive a crash (lockup) report, even with the patched core. Is @doctea's patch applied on top of anrp-tri's modified core, or to upstream?

My patch applies fine onto the upstream without needing to apply anrp-tri's patches first.

I've just ended up reinstalling my libraries and so lost all the patches I'd previously applied to it, and had the dreaded return of these intermittent freezes. Since applying my patch from up above, I haven't had a freeze so far. As the freezes were intermittent anyway its too early to say for sure, but this might have solved the problem entirely for me. Quite possible that @anrp-tri's patch fixes something important too, or that there are side-effects from my fix that I haven't noticed yet.

I'll post back if I start getting crashes again!

(edit -- I also seem to be able to print to the USB serial monitor without crashing now too -- not sure if this is because of this fix, or something else that's changed, but definitely not something I was able to do before!)

I see! It may have been a mere misunderstanding on my part, because my initial instinct was to merge both your (and anrp-tri's) patches, but doing so I still got a report about those freezes.

I've since made a second package, just based on the latest 3.6.1 core from upstream with only your one-line patch in the tusb.c files, and so far it doesn't seem to cause an issue? My tester hasn't gotten back to me yet, but I hope it will at least hold things over in the short term--this problem was driving me up a darn wall otherwise. (^^;

EDIT: Got another crash report. I don't know if it's just that my use case is much more severe or what, in terms of data being sent in both directions, but this remains a showstopping issue.

doctea commented 9 months ago

EDIT: Got another crash report. I don't know if it's just that my use case is much more severe or what, in terms of data being sent in both directions, but this remains a showstopping issue.

Dang, sorry to hear that :(.

FWIW my patch was originally intended to be applied in conjunction with @anrp-tri 's, so you weren't mistaken there, but my patch didn't rely on anything in anrp-tri's.

Maybe I've just got lucky that I've not seen any crashes in my project lately. The only other things I can think that I've tweaked in my code since the last crashes is to be more careful about enabling/disabling interrupts in critical sections of code, perhaps that's contributed to the apparently increased stability.

SeongGino commented 9 months ago

FWIW, my code doesn't use interrupts--minus one hardware timer that updates a camera tick flag every ~4ms, but that's all.

I've looked at my external libraries that I use, and nothing else in either those libraries or the main sketch uses interrupts; there's only one other while() loop that gets triggered (for clearing an I2C buffer just-in-case for the camera) - the major things the code does is receives then processes serial data, and sends commands for the Arduino mouse/keyboard down the USB cable.

Perhaps if either I or my sole tester had a different board that wasn't an RP2040, it might be even more helpful now, aha. But it just so happens we both only have Pico-adjacent boards, as does everyone in this thread. (^^;

I could probably try and see if the Pico SDK USB stack changes things? But that would require redoing my project's input handling because we use a custom library tailored to TinyUSB for the absmouse & keyboard handling (and I'd probably explore that more if I didn't have other life obligations rn. Adulting is hard).

doctea commented 7 months ago

Just wanted comment back here again as I've just spent several days trying to figure out why my project was crashing when running it from USB MIDI clock again, until reading through relevant issues here reminded me of my above patch and that I hadn't re-applied it since reinstalling libraries. Doing so seems again to have solved the problem with crashes.

I also note that the same workaround/fix (changing OSAL_TIMEOUT_WAIT_FOREVER to OSAL_TIMEOUT_NORMAL) solved the problem for them too in #238 .

What are the ramifications of this? Is there a more proper fix?

SeongGino commented 7 months ago

I can say I haven't run into crashes personally when using a forked version of Earle's Pico core with the workaround applied (I've heard one conflicting report otherwise, but have yet to hear back from them), and at least for my purpose it hasn't affected the stability of what I'm trying to do either.

Still, not an ideal circumstance, for sure.

SeongGino commented 7 months ago

Okay, update: I'm indeed running into the problem anew. Turns out it's because I had the TinyUSB library somehow installed by a dependency, which was overriding the fixed version built-in to the core. Removing the separate library version caused a change in file size and made it not hang again.

palthedog commented 1 month ago

I've looked into the issue today so let me share some thoughts.

The workaround @doctea shared long time ago: OSAL_TIMEOUT_WAIT_FOREVER -> OSAL_TIMEOUT_NORMAL hides the problem (prevent the deadlock) but may cause even more complecated problems.

I think there can be at least 2 issues:

  1. tu_edpt_claim() ignores the mutex-lock result. As a consequence, multiple user codes can claim a same endpoint at a same time. (broken data can be sent/received, I guess. I'm not familiar with TinyUSB implementation )
  2. Your application code freezes for 10ms very often (as often as the deadlock happens w/o the patch). Yeah, OSAL_TIMEOUT_NORMAL means to wait for 10ms instead of forever.

A reason why the mutex lock isn't released forever is very simple. It happens if:

  1. tu_edpt_claim() is called via Adafruit API (e.g. Adafruit_USBD_HID::sendReport). It locks the _usbd_mutex.
  2. Before tu_edpt_claim() releases the mutex, the device receives USBCTL_IRC interrupt request. It invokes tud_task(). See: TinyUSB_Port_InitDevice and usb_task_irq for RP2040
  3. tud_task() calls tu_edpt_claim() as well and it waits for forever because of OSAL_TIMEOUT_FOREVER.
  4. However, the mutex will never be released because the mutex is locked by the same core at 1.

Temporal fix for my project is to prevent the IRQ while calling Adafruit_USBD_HID::sendReport like:

irq_set_enabled(USBCTRL_IRQ, false);
bool result = usb_hid.sendReport(0, &report, sizeof(report));
irq_set_enabled(USBCTRL_IRQ, true);
// Might be a good idea to process events here?
// TinyUSB_Device_Task();

With this change, I don't see any deadlock/frame-drop on my device. However, I don't think it's feasible solution for many of you guys in this thread.

Finally, my question to @hathach is: Is it actually safe to call tud_task from IRQ handler? There seems to be many conflicted functions in other APIs (like Adafruit_USBD_HID::sendReport) RP2040 port seems to be written at 3 years ago, so you may have a better idea in your mind?

palthedog commented 1 month ago

Let me note that a reason why it happens only if we use multiple USB classes (e.g. Serial and HID, Serial and MIDI) is; There are some safe guards (w/o mutex) which prevents locking same mutex for same ENDPOINT. Like

TU_VERIFY((ep_state->busy == 0) && (ep_state->claimed == 0));
doctea commented 1 month ago

I've looked into the issue today so let me share some thoughts.

The workaround @doctea shared long time ago: OSAL_TIMEOUT_WAIT_FOREVER -> OSAL_TIMEOUT_NORMAL hides the problem (prevent the deadlock) but may cause even more complecated problems.

Thank you for looking into this and identifying that my simple workaround this isn't a proper fix! Hoping this leads to a proper fix :).

Wondering if the 10ms timeout on the deadlock explains some timing problems I see in my project, too (it is a MIDI app and the clock runs ever-so-slightly slower if a USB cable is connected..).

With this change, I don't see any deadlock/frame-drop on my device. However, I don't think it's feasible solution for many of you guys in this thread.

Indeed, my project doesn't call sendReport itself, so I don't think I can add the same "protection" to my project in this way? Would calling irq_set_enabled before I send data from my project perhaps be enough?

Excited to see progress on a real solution!

palthedog commented 1 month ago

@doctea I am grateful to you. I didn't lose my motivation while prototyping because of your workaround :)

Indeed, my project doesn't call sendReport itself, so I don't think I can add the same "protection" to my project in this way? Would calling irq_set_enabled before I send data from my project perhaps be enough?

I'm not familiar with MIDI implementation but as far as I checked, Adafruit_USBD_MIDI::write calls usbd_edpt_claim in its invocation chain, so suppressing IRQ handler befor write might help you.

Also don't forget to re-enable IRQ handler after write.

SeongGino commented 1 month ago

Temporal fix for my project is to prevent the IRQ while calling Adafruit_USBD_HID::sendReport like:

irq_set_enabled(USBCTRL_IRQ, false);
bool result = usb_hid.sendReport(0, &report, sizeof(report));
irq_set_enabled(USBCTRL_IRQ, true);
// Might be a good idea to process events here?
// TinyUSB_Device_Task();

With this change, I don't see any deadlock/frame-drop on my device. However, I don't think it's feasible solution for many of you guys in this thread.

I'll keep this info in mind! In my case, it might be useful; though it is possible that in my project, both cores might be trying to send input reports at around the same time (with one core sending mouse coords from camera processing, and the other handling button presses and analog polling). At least doctea's hacksaw fix has been working adequately for my needs in the meantime until I either perform all reports on a single thread (where IRQ flip-flopping would be safe) or the underlying deadlock is resolved.

palthedog commented 1 month ago

though it is possible that in my project, both cores might be trying to send input reports at around the same time

I'm sure that sending reports from multi-cores at the same time isn't problematic because mutex-lock works properly in this case. Like; the core-A can keep processing even while the core-B is waiting for the lock. Then, at some point, the core-A will release the lock. USBCTRL_IRQ is problematic because the interrupt happens on the same core.

fred-rabelo commented 1 month ago

I'm not familiar with MIDI implementation but as far as I checked, Adafruit_USBD_MIDI::write calls usbd_edpt_claim in its invocation chain, so suppressing IRQ handler befor write might help you.

Also don't forget to re-enable IRQ handler after write.

unfotunatly doing this did not work for me:

irq_set_enabled(USBCTRL_IRQ, false);  
PID::MIDI.sendControlChange(byte, byte, int);  // which calls ::send(const MidiMessage& inMessage), which calls mTransport.write();
irq_set_enabled(USBCTRL_IRQ, true);

board still crashes for my application

i was not able to find what .write() actually does though. I´m not a programmer and am having a difficult time digging in to the library.

for now i will use the timeout_normal solution to finish prototyping.

schollz commented 1 month ago

Just another data point - I am using the rp2040 with C code (no Adafruit libraries) using the pico-sdk v1.5.1 which bundles TinyUSB v0.15.0. I found that simply calling tud_midi_n_stream_read would result in a crash of my device after 2-5 minutes when listening to MIDI timing messages (source).

I implemented the OSAL_TIMEOUT_WAIT_FOREVER to OSAL_TIMEOUT_NORMAL change on tinyusb v0.15.0 and I'm not having any more crashes. Thank you @doctea :heart: