adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.04k stars 1.19k forks source link

hard fault on serial disconnect on MacroPad RP2040 #8824

Closed tlyu closed 6 months ago

tlyu commented 7 months ago

CircuitPython version

Adafruit CircuitPython 9.0.0-alpha.6-33-g9f016796e0-dirty on 2024-01-22; Adafruit Macropad RP2040 with rp2040

Code/REPL

# N/A; also happens with freshly reformatted CIRCUITPY drive

Behavior

soft reboot

Auto-reload is off.
Running in safe mode! Not running saved code.

You are in safe mode because:
CircuitPython core code crashed hard. Whoops!
Hard fault: memory access or instruction error.
Please file an issue with your program at github.com/adafruit/circuitpython/issues.
Press reset to exit safe mode.

Press any key to enter the REPL. Use CTRL-D to reload.

Description

Happens after some random number of serial disconnects, either at the REPL or at the "code done running" prompt. Only tested using tio on macOS 13.6.3. The CIRCUITPY volume also unmounts, with a warning about unsafe unmount, so I'm guessing it actually detached from USB in the process.

Additional information

No response

tlyu commented 7 months ago

Also apparently happens with

Adafruit CircuitPython 8.2.9 on 2023-12-06; Adafruit Macropad RP2040 with rp2040

so possibly not a new bug?

tlyu commented 7 months ago

Does not happen with nRF52840:

Adafruit CircuitPython 8.2.9 on 2023-12-06; Adafruit ItsyBitsy nRF52840 Express with nRF52840

(it was a board that I conveniently already soldered a debug harness to)

I'm not sure I have a an RP2040 board with easily accessible SWD traces to hook up a debugger to.

tannewt commented 7 months ago

Happens after some random number of serial disconnects, either at the REPL or at the "code done running" prompt.

What is causing these disconnects?

tlyu commented 7 months ago

I'm causing the disconnects by exiting tio by using Ctrl-T Q. That shouldn't cause a hard fault in CircuitPython.

tlyu commented 7 months ago

This also happens on:

Adafruit CircuitPython 8.2.9 on 2023-12-06; Adafruit KB2040 with rp2040

so I'm guessing it's generally true of the RP2040 port.

tlyu commented 7 months ago

Replicated on macOS 10.5.7 on an Intel MacBook Pro with

Adafruit CircuitPython 9.0.0-beta.0 on 2024-01-27; Adafruit Macropad RP2040 with rp2040

and tio v2.4.

tlyu commented 7 months ago

I'd be happy to get a stack trace or even do a bit of debugging once I get a RP2040 board with accessible SWD pins, which I don't currently have. It might be a while before I get around to obtaining one.

If you have a spare Pico or even better, a Feather RP2040 with SWD header that you can send my way, that would be great. Otherwise, I probably won't get one until the next time I have an Adafruit order to add on to.

tlyu commented 7 months ago

Additional observation: running this loop on a M2 MacBook Air will succeed for a long time (several minutes before I stopped it), but will tend to cause the crash once interrupted with Ctrl-C:

while :; do stty -a < /dev/cu.usbmodem1101; done

In the past, when tracing USB CDC traffic on a different device with a Beagle, I noticed that macOS will sometimes abort a CDC-related control transfer partway through, and I wonder if that's related.

ure commented 7 months ago

Same issue here ... ;(

dhalbert commented 7 months ago

@tlyu what version of macOS are you using? @ure What is your OS, and if macOS, what kind of Mac?

tlyu commented 7 months ago

Mentioned above: 13.6.3 on the M2 MBA; 10.5.7 on the Intel MBP. If it does happen to be a timing issue, CPU speed and system load might be a factor.

ure commented 7 months ago

Al running sonoma 14.3 one mac mini m1, one studio m2, macbook m1. I can confirm that this does not occur on linux, don't have access to a windows computer

tlyu commented 7 months ago

It's likely to be a lower-level problem, possibly with the Pico SDK. On a hunch, I uploaded a Blink sketch from the Arduino IDE, Philhower core v3.6.2. connecting and then disconnecting with tio causes the program to freeze and the device to become unresponsive on USB CDC until physically reset.

ure commented 7 months ago

@tlyu ok super, if fixed upstream this will follow in a future release, nice ;)

tlyu commented 7 months ago

Updated to 3.7.0 of the Philhower core; bug still present.

tlyu commented 7 months ago

With the Pico SDK, using a blink example modified to enable CDC for stdio, though not outputting anything to it, connecting and disconnecting from the CDC serial port a few times using tio freezes the program and makes CDC unresponsive.

dhalbert commented 7 months ago

With the Pico SDK, using a blink example modified to enable CDC for stdio, though not outputting anything to it, connecting and disconnecting from the CDC serial port a few times using tio freezes the program and makes CDC unresponsive.

This is very interesting. Pinging @hathach on this: this problem appears specific to macOS. This could be an issued opened either on https://github.com/hathach or https://github.com/raspberrypi/pico-sdk. @hathach would you prefer one repo or the other.

tlyu commented 7 months ago

Update: the hello_world/usb Pico SDK example also fails in a similar way, though somewhat less obviously than a blinking LED ceasing to blink.

tannewt commented 6 months ago

@hathach Mind looking at this?

eightycc commented 6 months ago

Crash is readily reproducible using tio when disconnecting by ctrl-T Q. Running MacOS Sonoma 14.4 beta 4.

So far, I've found the crash is due to panic() being called under one or the other of these conditions:

tlyu commented 6 months ago

This is starting to look like a possible hardware erratum where the device controller doesn't clear USB_BUF_CTRL_AVAIL on EP0 when a SETUP packet arrives before the OUT packet from a preceding control write. In the first case, there's a panic because TinyUSB finds the flag already set. In the second case, TinyUSB gets a completion interrupt from the OUT packet arriving before it configures EP0 to receive the data stage, because the device controller accepts the OUT DATA due to the flag already being set from the previous (aborted) transfer.

If this is an erratum, it can't be worked around in the general case, because there's no guarantee that the aborted control transfer has the same data length as the current one. Also, there's no guarantee that TinyUSB can clear the errant flag before the device controller acts on it.

I still don't have a RP2040 board with accessible SWD, so this is analysis based on the TinyUSB source code and the RP2040 datasheet.

tlyu commented 6 months ago

@eightycc thanks for the debugging! Could you possibly get full stack traces for those?

eightycc commented 6 months ago

Here is a backtrace from a breakpoint at panic():

Thread 1 hit Breakpoint 1, panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
307 void __attribute__((noreturn)) __printflike(1, 0) panic(const char *fmt, ...) {
(gdb) bt
#0  panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
#1  0x20000ba2 in _hw_endpoint_buffer_control_update32 (ep=ep@entry=0x2000d0cc <hw_endpoints+32>, and_mask=and_mask@entry=0, 
    or_mask=or_mask@entry=54272) at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:115
#2  0x20000c06 in _hw_endpoint_buffer_control_set_value32 (value=54272, ep=0x2000d0cc <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.h:122
#3  hw_endpoint_start_next_buffer (ep=ep@entry=0x2000d0cc <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:212
#4  0x10032596 in hw_endpoint_xfer_start (ep=ep@entry=0x2000d0cc <hw_endpoints+32>, buffer=buffer@entry=0x0, total_len=total_len@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:245
#5  0x100322ac in hw_endpoint_xfer (ep_addr=128 '\200', buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:188
#6  0x1003245a in dcd_edpt_xfer (rhport=0 '\000', ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:531
#7  0x10022eb0 in usbd_edpt_xfer (rhport=<optimized out>, ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/device/usbd.c:1249
#8  0x100235a4 in _status_stage_xact (rhport=rhport@entry=0 '\000', request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd_control.c:70
#9  0x1002362c in tud_control_status (rhport=0 '\000', request=request@entry=0x20040f08) at ../../lib/tinyusb/src/device/usbd_control.c:81
#10 0x10022660 in cdcd_control_xfer_cb (rhport=<optimized out>, stage=<optimized out>, request=0x20040f08)
    at ../../lib/tinyusb/src/class/cdc/cdc_device.c:383
#11 0x100228e6 in invoke_class_control (rhport=rhport@entry=0 '\000', driver=0x100dbe40 <_usbd_driver>, request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:611
#12 0x100231e0 in process_control_request (rhport=<optimized out>, p_request=p_request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:774
#13 0x1002341a in tud_task_ext (timeout_ms=timeout_ms@entry=4294967295, in_isr=in_isr@entry=false) at ../../lib/tinyusb/src/device/usbd.c:522
#14 0x10023e10 in tud_task () at ../../lib/tinyusb/src/device/usbd.h:54
#15 usb_background () at ../../supervisor/shared/usb/usb.c:170
#16 0x10023e34 in usb_background_do (unused=<optimized out>) at ../../supervisor/shared/usb/usb.c:190
#17 0x2000704e in background_callback_run_all () at ../../supervisor/shared/background_callback.c:102
#18 0x10020d86 in mp_hal_stdin_rx_chr () at ../../supervisor/shared/micropython.c:49
#19 0x1009082a in readline (line=line@entry=0x20040f88, prompt=prompt@entry=0x100cfa0c ">>> ") at ../../shared/readline/readline.c:616
#20 0x10095c3c in pyexec_friendly_repl () at ../../shared/runtime/pyexec.c:674
#21 0x1001f106 in run_repl (safe_mode=SAFE_MODE_NONE) at ../../main.c:962
#22 0x1001f762 in main () at ../../main.c:1100

And some light tracing in TinyUSB:

Adafruit CircuitPython 9.0.0-beta.2-6-g89437cf031-dirty on 2024-02-22; Raspberry Pi Pico W with rp2040hw_endpoint_xfer 82 0x20009d9c 64
hw_endpoint_xfer_start ep 2 in 64
hw_endpoint_xfer 82 0x20009d9c 38
hw_endpoint_xfer_start ep 2 in 38

hw_endpoint_xfer 82 0x20009d9c 2
hw_endpoint_xfer_start ep 2 in 2
>>> hw_endpoint_xfer 82 0x20009d9c 4
hw_endpoint_xfer_start ep 2 in 4
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
WARN: starting new transfer on already active ep 0 in
eightycc commented 6 months ago

@tlyu The reason that the trace is light is that full level 3 tracing in TinyUSB affects timing to the extent that MacOS cannot enumerate the device. If you'd like more tracing, let me know what you'd like and I'll add it.

tlyu commented 6 months ago

@eightycc thanks! That is helpful, even if the trace is missing the control request details. It looks like in this instance, there are back-to-back control writes with no data, so they're answered with a ZLP IN transaction (control status stage). The one that panics is because the AVAILABLE flag set by the previous status stage wasn't cleared by the device controller hardware when the transfer was aborted with a new SETUP packet.

eightycc commented 6 months ago

@tlyu Here are the traces of single failure event. The Control Center capture is the zip file at the end.

Thread 1 hit Breakpoint 1, panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
307 void __attribute__((noreturn)) __printflike(1, 0) panic(const char *fmt, ...) {
(gdb) bt
#0  panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
#1  0x20000ba2 in _hw_endpoint_buffer_control_update32 (ep=ep@entry=0x2000d0cc <hw_endpoints+32>, and_mask=and_mask@entry=0, 
    or_mask=or_mask@entry=54272) at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:115
#2  0x20000c06 in _hw_endpoint_buffer_control_set_value32 (value=54272, ep=0x2000d0cc <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.h:122
#3  hw_endpoint_start_next_buffer (ep=ep@entry=0x2000d0cc <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:212
#4  0x10032596 in hw_endpoint_xfer_start (ep=ep@entry=0x2000d0cc <hw_endpoints+32>, buffer=buffer@entry=0x0, total_len=total_len@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:245
#5  0x100322ac in hw_endpoint_xfer (ep_addr=128 '\200', buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:188
#6  0x1003245a in dcd_edpt_xfer (rhport=0 '\000', ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:531
#7  0x10022eb0 in usbd_edpt_xfer (rhport=<optimized out>, ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/device/usbd.c:1249
#8  0x100235a4 in _status_stage_xact (rhport=rhport@entry=0 '\000', request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd_control.c:70
#9  0x1002362c in tud_control_status (rhport=0 '\000', request=request@entry=0x20040f08) at ../../lib/tinyusb/src/device/usbd_control.c:81
#10 0x10022660 in cdcd_control_xfer_cb (rhport=<optimized out>, stage=<optimized out>, request=0x20040f08)
    at ../../lib/tinyusb/src/class/cdc/cdc_device.c:383
#11 0x100228e6 in invoke_class_control (rhport=rhport@entry=0 '\000', driver=0x100dbe40 <_usbd_driver>, request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:611
#12 0x100231e0 in process_control_request (rhport=<optimized out>, p_request=p_request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:774
#13 0x1002341a in tud_task_ext (timeout_ms=timeout_ms@entry=4294967295, in_isr=in_isr@entry=false) at ../../lib/tinyusb/src/device/usbd.c:522
#14 0x10023e10 in tud_task () at ../../lib/tinyusb/src/device/usbd.h:54
#15 usb_background () at ../../supervisor/shared/usb/usb.c:170
#16 0x10023e34 in usb_background_do (unused=<optimized out>) at ../../supervisor/shared/usb/usb.c:190
#17 0x2000704e in background_callback_run_all () at ../../supervisor/shared/background_callback.c:102
#18 0x10020d86 in mp_hal_stdin_rx_chr () at ../../supervisor/shared/micropython.c:49
#19 0x1009082a in readline (line=line@entry=0x20040f88, prompt=prompt@entry=0x100cfa0c ">>> ") at ../../shared/readline/readline.c:616
#20 0x10095c3c in pyexec_friendly_repl () at ../../shared/runtime/pyexec.c:674
#21 0x1001f106 in run_repl (safe_mode=SAFE_MODE_NONE) at ../../main.c:962
#22 0x1001f762 in main () at ../../main.c:1100
Adafruit CircuitPython 9.0.0-beta.2-6-g89437cf031-dirty on 2024-02-22; Raspberry Pi Pico W with rp2040hw_endpoint_xfer 82 0x20009d9c 64
hw_endpoint_xfer_start ep 2 in 64
hw_endpoint_xfer 82 0x20009d9c 38
hw_endpoint_xfer_start ep 2 in 38

hw_endpoint_xfer 82 0x20009d9c 2
hw_endpoint_xfer_start ep 2 in 2
>>> hw_endpoint_xfer 82 0x20009d9c 4
hw_endpoint_xfer_start ep 2 in 4
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000ac94 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
WARN: starting new transfer on already active ep 0 in

issue8824.zip

tlyu commented 6 months ago

@eightycc thanks so much for the coordinated traces! The Beagle trace confirms what I've seen before in a previous context: some CDC control requests sent by macOS are aborted. It also confirms that my suspicions are likely correct, that the RP2040 hardware doesn't clear some state for a SETUP packet that it probably should.

There are 4 OUT control transfers of length 7, which are CDC Set Line Coding, followed by 3 no-data control transfers, which are CDC Set Control Line State.

The first 2 Set Control Line State are aborted, and simply doesn't poll the status stage for the third, for some reason.

There are 3 hw_endpoint_xfer_start ep 0 in 0 in the trace, corresponding to the status stage of the Set Line Coding, followed by the status stages of the 2 Set Control Line State. The second one panics due to an existing active transfer.

dhalbert commented 6 months ago

@tlyu Have you discussed these possible issues with the RP2040 folks? I think https://github.com/raspberrypi/pico-feedback is the right place. There are some USB hw issues there already.

I'm guessing that they licensed IP for the USB peripheral and did not design it themselves. But I might be wrong. @hathach would probably know.

tlyu commented 6 months ago

@dhalbert not yet; thanks for the pointer! It only occurred to me that it might be a hardware erratum after seeing the stack traces and doing a close read of the TinyUSB code.

tlyu commented 6 months ago

@eightycc please try hathach/tinyusb#2492 to see if that helps?

eightycc commented 6 months ago

@tlyu Your patch fixes the WARN: starting new transfer on already active ep 0 in failure mode, but the ep 0 in was already available remains. Both failure modes remain. New behavior is that it now occasionally succeeds. I'll follow up with traces shortly.

eightycc commented 6 months ago
Thread 1 hit Breakpoint 1, panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
307 void __attribute__((noreturn)) __printflike(1, 0) panic(const char *fmt, ...) {
(gdb) bt
#0  panic (fmt=0x0) at sdk/src/rp2_common/pico_runtime/runtime.c:307
#1  0x20000bc6 in _hw_endpoint_buffer_control_update32 (ep=ep@entry=0x2000d0ec <hw_endpoints+32>, and_mask=and_mask@entry=0, 
    or_mask=or_mask@entry=54272) at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:115
#2  0x20000c2a in _hw_endpoint_buffer_control_set_value32 (value=54272, ep=0x2000d0ec <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.h:122
#3  hw_endpoint_start_next_buffer (ep=ep@entry=0x2000d0ec <hw_endpoints+32>)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:212
#4  0x10032596 in hw_endpoint_xfer_start (ep=ep@entry=0x2000d0ec <hw_endpoints+32>, buffer=buffer@entry=0x0, total_len=total_len@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/rp2040_usb.c:245
#5  0x100322ac in hw_endpoint_xfer (ep_addr=128 '\200', buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:188
#6  0x1003245a in dcd_edpt_xfer (rhport=0 '\000', ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/portable/raspberrypi/rp2040/dcd_rp2040.c:538
#7  0x10022eb0 in usbd_edpt_xfer (rhport=<optimized out>, ep_addr=<optimized out>, buffer=buffer@entry=0x0, total_bytes=total_bytes@entry=0)
    at ../../lib/tinyusb/src/device/usbd.c:1249
#8  0x100235a4 in _status_stage_xact (rhport=rhport@entry=0 '\000', request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd_control.c:70
#9  0x1002362c in tud_control_status (rhport=0 '\000', request=request@entry=0x20040f08) at ../../lib/tinyusb/src/device/usbd_control.c:81
#10 0x10022660 in cdcd_control_xfer_cb (rhport=<optimized out>, stage=<optimized out>, request=0x20040f08)
    at ../../lib/tinyusb/src/class/cdc/cdc_device.c:383
#11 0x100228e6 in invoke_class_control (rhport=rhport@entry=0 '\000', driver=0x100dbe40 <_usbd_driver>, request=request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:611
#12 0x100231e0 in process_control_request (rhport=<optimized out>, p_request=p_request@entry=0x20040f08)
    at ../../lib/tinyusb/src/device/usbd.c:774
#13 0x1002341a in tud_task_ext (timeout_ms=timeout_ms@entry=4294967295, in_isr=in_isr@entry=false) at ../../lib/tinyusb/src/device/usbd.c:522
#14 0x10023e10 in tud_task () at ../../lib/tinyusb/src/device/usbd.h:54
#15 usb_background () at ../../supervisor/shared/usb/usb.c:170
#16 0x10023e34 in usb_background_do (unused=<optimized out>) at ../../supervisor/shared/usb/usb.c:190
#17 0x2000706e in background_callback_run_all () at ../../supervisor/shared/background_callback.c:102
#18 0x10020d86 in mp_hal_stdin_rx_chr () at ../../supervisor/shared/micropython.c:49
#19 0x1009082a in readline (line=line@entry=0x20040f88, prompt=prompt@entry=0x100cfa0c ">>> ") at ../../shared/readline/readline.c:616
#20 0x10095c3c in pyexec_friendly_repl () at ../../shared/runtime/pyexec.c:674
#21 0x1001f106 in run_repl (safe_mode=SAFE_MODE_NONE) at ../../main.c:962
#22 0x1001f762 in main () at ../../main.c:1100
Adafruit CircuitPython 9.0.0-beta.2-6-g89437cf031-dirty on 2024-02-25; Raspberry Pi Pico W with rp2040hw_endpoint_xfer 82 0x20009dbc 64
hw_endpoint_xfer_start ep 2 in 64
hw_endpoint_xfer 82 0x20009dbc 38
hw_endpoint_xfer_start ep 2 in 38

hw_endpoint_xfer 82 0x20009dbc 2
hw_endpoint_xfer_start ep 2 in 2
>>> hw_endpoint_xfer 82 0x20009dbc 4
hw_endpoint_xfer_start ep 2 in 4
hw_endpoint_xfer 00 0x2000acb4 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000acb4 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000acb4 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 00 0x2000acb4 7
hw_endpoint_xfer_start ep 0 out 7
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
hw_endpoint_xfer 80 0x0 0
hw_endpoint_xfer_start ep 0 in 0
WARN: starting new transfer on already active ep 0 in

issue8824-2.tdc.zip

tlyu commented 6 months ago

@eightycc thanks! ~Which RP2040 board are you using, and what is its CPU clock speed?~ (edit: OK, I see it's a Pico W) I guess it could also be down to a host-side timing issue as to why I'm not longer seeing failures and you are.

There's some TinyUSB-side state that I needed to reset as well, which explains the "already active ep 0". There's also a delay missing from the buffer control setting, which might be necessary to ensure that the AVAILABLE flag is already clear before proceeding with the rest of the request handling.

tlyu commented 6 months ago

@eightycc I updated the patch in the TinyUSB PR; please try it when you get the chance?

eightycc commented 6 months ago

@tlyu Now fails 10 out of 10 attempts.

I've had to add parentheses around the expression usb_hw->abort_done & 0x3 to quiet a gcc 13.2 warning. Doubt this makes any difference.

tlyu commented 6 months ago

Thanks, please try again? I had a precedence oops previously with bitwise operators.

tlyu commented 6 months ago

I might need to add an explicit clear of the ABORT_DONE register. I'm not sure when I'll get to it yet.

tlyu commented 6 months ago

@eightycc what chip revisions do you have? Apparently revisions earlier than RP2040B2 have an erratum (RP2040-E2) that causes endpoint aborts to NAK the endpoint forever.

eightycc commented 6 months ago

@tlyu The Pico W I've been testing with is B2, the Pico that fails to enumerate with your patch is a B1.

One other thing to note is that for testing I build with optimization flag -Og to improve the quality of debug info. This will certainly affect timing.

eightycc commented 6 months ago

Note that hw_endpoint_lock_update() is null, so calls to hw_endpoint_start_next_buffer() are unprotected against concurrent calls from user and IRQ level.

tlyu commented 6 months ago

I think that I might have to accept a larger race condition window on B1 hardware and earlier, by only doing the ABORT for the chip versions where it works.

tlyu commented 6 months ago

@eightycc updated the PR to only use ABORT on hardware B2 and later. This means the workaround should work in some cases on B1 and earlier, if the firmware wins the race condition. Please give it a try when you have a chance.

(The failure to enumerate on B1 is consistent with the description of the erratum RP2040-E2.)

eightycc commented 6 months ago

@tlyu I've done some testing with your PR and find that it works with B2 but B1 still has problems.

I believe that the root cause is the race condition in hw_endpoint_start_next_buffer() noted earlier. Saving the interrupt state and disabling interrupts at the start of hw_endpoint_start_next_buffer() and then restoring interrupt state at its exit resolves the problem for both B1 and B2. I am able to build with -Og and run with TinyUSB level-3 tracing with no difficulties.

tlyu commented 6 months ago

@eightycc thanks for the additional testing! I think you might be right about the user/interrupt context race condition being a component, and that fixing that race can mitigate the problem on B1. I think that without the endpoint aborts, an incorrect transaction can still get dispatched under a different scenario.

In the particular instance we're seeing with macOS and closing of CDC serial devices, the transactions queued for the aborted CDC control transfer are still valid for the duplicated one that follows, but that might always not be the case for other instances of this problem.

eightycc commented 6 months ago

In the particular instance we're seeing with macOS and closing of CDC serial devices, the transactions queued for the aborted CDC control transfer are still valid for the duplicated one that follows, but that might always not be the case for other instances of this problem.

Good point. I think the final PR should incorporate both of our patches.

eightycc commented 6 months ago

@tlyu I've been testing with both patches in place on both B1 and B2 RP2040s and cannot get it to fail.

You can find my patch here: https://github.com/eightycc/tinyusb/tree/issue_8824. You can add it to your PR.

hathach commented 6 months ago

thank you @tlyu and @eightycc for detailed explanation, I have pull out my good-old intel macbook with mac0s 12.6 and it seems to be able to reproduce the issue. Indeed this seems to be macos specific issue (probably related to how its serial driver is implemented). I will do start to investigate and review your PR.

hathach commented 6 months ago

tinyusb pr is merged, cpy PR is on the way https://github.com/adafruit/circuitpython/pull/9042 . @tlyu @eightycc would you mind testing out the cpy PR to confirm if issue is fixed. Thank you.

dhalbert commented 6 months ago

Moving this back to 9.0.0 assuming it's fixed.