adafruit / circuitpython

CircuitPython - a Python implementation for teaching coding with microcontrollers
https://circuitpython.org
Other
4.06k stars 1.2k forks source link

Linux system hangs on USB disconnect - sometimes #3268

Closed jerryneedell closed 2 years ago

jerryneedell commented 4 years ago

For the past few months, I have noticed that sometimes when I physically disconnects CP boards USB plug, the Linux system immediately "locks" up requiring a Power Cycle to recover.

this may be related to #3223 for reference @dhalbert reported this to the Linux folks: here is his correspondence: My original Ubuntu bug report: https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1871143

Correspondence on the linux-usb mailing list: https://marc.info/?l=linux-usb&m=159387610928589&w=2

This is a quick data dump of recent information:

This has been triggered by several MCU boards - fearther_m4_express, feather_m0_express, esp32s2 (saola wrover)

My Linux system is running Ubuntu 20.04 and is an older (~2012 mac-mini) Here is a list of the USB modules

jerryneedell@jerryneedell-ubuntu-macmini:~$ lspci -tv
-[0000:00]-+-00.0  Intel Corporation 3rd Gen Core processor DRAM Controller
           +-01.0-[04-9a]----00.0-[05-6a]--+-00.0-[06]----00.0  Intel Corporation DSL3510 Thunderbolt Controller [Cactus Ridge 4C 2012]
           |                               +-03.0-[07-37]--
           |                               +-04.0-[38-68]--
           |                               +-05.0-[69]--
           |                               \-06.0-[6a]--
           +-02.0  Intel Corporation 3rd Gen Core processor Graphics Controller
           +-14.0  Intel Corporation 7 Series/C210 Series Chipset Family USB xHCI Host Controller
           +-16.0  Intel Corporation 7 Series/C216 Chipset Family MEI Controller #1
           +-1a.0  Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #2
           +-1b.0  Intel Corporation 7 Series/C216 Chipset Family High Definition Audio Controller
           +-1c.0-[01]--+-00.0  Broadcom Inc. and subsidiaries NetXtreme BCM57766 Gigabit Ethernet PCIe
           |            \-00.1  Broadcom Inc. and subsidiaries BCM57765/57785 SDXC/MMC Card Reader
           +-1c.1-[02]----00.0  Broadcom Inc. and subsidiaries BCM4331 802.11a/b/g/n
           +-1c.2-[03]----00.0  LSI Corporation FW643 [TrueFire] PCIe 1394b Controller
           +-1d.0  Intel Corporation 7 Series/C216 Chipset Family USB Enhanced Host Controller #1
           +-1f.0  Intel Corporation HM77 Express Chipset LPC Controller
           +-1f.2  Intel Corporation 7 Series Chipset Family 6-port SATA Controller [AHCI mode]
           \-1f.3  Intel Corporation 7 Series/C216 Chipset Family SMBus Controller

In an attempt to troubleshoot, I tried testing the timyusb example https://github.com/hathach/tinyusb/tree/22100b252fc2eb8f51ed407949645653c4880fd9/examples/device/cdc_msc ( with the cdc_msc_freertos version for esp332s2)
I built it for the esp32s2, feather_m4_express, feather_m0_express

I loaded it onto boards and repeatedly connected/disconnected them from the Linux box with no issues.

I then restored the feather_m4_express to the latest main build of CP and it cause a hang on the first disconnect.

I installed linux-crashdump on my system and tried some connect/disconnect cycles. Note: For this crash, I was actively connected to the REPL -- I do not think that has always been the case. when it crashed, it hung immediately on disconnect as it has in the past, but then automatically rebooted. I assume this is due to linux-crashdump now bing installed... here is the crash dump file: crash.txt

wolfmanjm commented 3 years ago

I too have this issue since using circuit python on an RPI pico, and also a nucleo development board using the newer STLINK-V3 which also provides composite USB devices and an MSD for easy flashing.

I have been using Ubuntu 18.04, but I just upgraded to 20.04 hoping they may have fixed it. It has only been a day and no crashes (yet) but I have turned off auto mount, and I try to safely eject the drive before turning it off. (This seems to help).

A side note is I am the maintainer of some open source 3D printer firmware called smoothieware, and we have a composite MSD device and CDC/ACM which presents itself to the host. I have never had or seen reports of it crashing the host when turned off. (On the other hand I never auto mount the MSD device). Although it did used to blue screen windows when the MSD device was setup to be added dynamically instead of statically at boot.

Last night I discoved this in my dmesg, this usually precedes a crash, and I think is one of the precursors to this crash. There are thousands of these warnings. I rebooted as I have found once these start the eventual result is a crash. These usually happen after the device (my rpipico with CP) has been turned off.

The device I think is the xhci usb controller so this maybe related to usb3, even though I have a USB2 hub which these devices are plugged into.

xhci_hcd 0000:00:14.0: xHCI Host Controller

Mar 22 20:45:56 desktop kernel: [67557.394580] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.407289] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.407289] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.424061] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.424065] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.426164] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.426165] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.441060] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.441063] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.443132] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.443133] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.447506] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.447507] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.449455] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.449457] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.451498] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.451499] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.453492] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.453493] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.455457] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.455458] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.457503] xhci_hcd 0000:00:14.0: WARN Cannot submit Set TR Deq Ptr
Mar 22 20:45:56 desktop kernel: [67557.457507] xhci_hcd 0000:00:14.0: A Set TR Deq Ptr command is pending.
Mar 22 20:45:56 desktop kernel: [67557.468280] usb 1-5.4.4: USB disconnect, device number 39
wolfmanjm commented 3 years ago

We have a consultant helping us with Smoothie, and he has suggested this...

"the correct way to handle it is to report as a removable device, and generate an eject event rather than a disconnect and we want to do that anyway because there's some really annoying W10 bugs related to disappearing MSC devices"

This won't help when the CP device is just turned off or tugged out of the port though, but maybe a control key in the REPL could tell it to do this which we can do before turning it off.

tannewt commented 3 years ago

We do report as a removable device already because on Linux we get the eject command from the host and then report media isn't available. I don't know how that helps when a device is unplugged though.

wolfmanjm commented 3 years ago

I just got a general protection fault (Kubuntu 20.04). 6.2.0-beta.4 version of circuit python, I had unmounted the CIRCUITPYTHON device, and then powered off the PICO. Until now turning off auto mount and unmounting the device has saved me from getting these. This appears to be in the ACM/CDC driver though which I have never seen happen before. Although I got this luckily linux didn;t totally crash and I was able to reboot gracefully.

[684258.426420] general protection fault: 0000 [#1] SMP PTI
[684258.426429] CPU: 7 PID: 484810 Comm: kworker/7:0 Tainted: G           OE     5.4.0-67-generic #75-Ubuntu
[684258.426433] Hardware name: System manufacturer System Product Name/PRIME B250-PRO, BIOS 0304 11/13/2016
[684258.426444] Workqueue: events acm_softint [cdc_acm]
[684258.426455] RIP: 0010:usb_clear_halt+0x1d/0xc0
[684258.426461] Code: 85 f6 75 e8 c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 89 f0 48 89 e5 41 54 49 89 fc 53 89 f3 c1 fb 0f 83 e3 0f 48 83 ec 08 <8b> 37 c1 e6 08 81 ce 00 00 00 80 a8 80 74 44 68 88 13 00 00 80 cb
[684258.426464] RSP: 0018:ffffad66820bfe18 EFLAGS: 00010296
[684258.426469] RAX: 00000000c0414580 RBX: 0000000000000002 RCX: ffffffff8b5be128
[684258.426472] RDX: ffffffff8b5be128 RSI: 00000000c0414580 RDI: 46b3242fd3dd81d7
[684258.426475] RBP: ffffad66820bfe30 R08: 0000000000000000 R09: 0000000000000007
[684258.426478] R10: 000000000000000e R11: ffff914aaebe9aa0 R12: 46b3242fd3dd81d7
[684258.426481] R13: ffff914aaebf0700 R14: ffff9148b99c1000 R15: ffff914a5cacd540
[684258.426485] FS:  0000000000000000(0000) GS:ffff914aaebc0000(0000) knlGS:0000000000000000
[684258.426489] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[684258.426492] CR2: 00007f1383c3d000 CR3: 00000001fda0a006 CR4: 00000000003606e0
[684258.426495] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[684258.426499] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[684258.426501] Call Trace:
[684258.426512]  acm_softint+0x65/0xf0 [cdc_acm]
[684258.426521]  process_one_work+0x1eb/0x3b0
[684258.426529]  worker_thread+0x4d/0x400
[684258.426535]  kthread+0x104/0x140
[684258.426542]  ? process_one_work+0x3b0/0x3b0
[684258.426547]  ? kthread_park+0x90/0x90
[684258.426555]  ret_from_fork+0x35/0x40
[684258.426560] Modules linked in: uvcvideo videobuf2_vmalloc videobuf2_memops videobuf2_v4l2 videobuf2_common videodev ftdi_sio nls_iso8859_1 snd_usb_audio uas snd_usbmidi_lib usb_storage mc cdc_acm cp210x usbserial btrfs xor zstd_compress raid6_pq ufs qnx4 hfsplus hfs minix ntfs msdos jfs xfs libcrc32c cpuid vboxnetadp(OE) vboxnetflt(OE) vboxdrv(OE) binfmt_misc snd_hda_codec_hdmi snd_hda_codec_realtek snd_hda_codec_generic ledtrig_audio snd_hda_intel snd_intel_dspcfg snd_hda_codec snd_hda_core snd_hwdep snd_pcm snd_seq_midi snd_seq_midi_event snd_rawmidi intel_rapl_msr mei_hdcp joydev snd_seq intel_rapl_common x86_pkg_temp_thermal intel_powerclamp snd_seq_device input_leds hid_multitouch snd_timer coretemp kvm_intel kvm snd rapl intel_cstate eeepc_wmi asus_wmi mei_me sparse_keymap wmi_bmof serio_raw soundcore mei mac_hid acpi_pad sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 hid_generic i915 i2c_algo_bit drm_kms_helper syscopyarea crct10dif_pclmul sysfillrect
[684258.426617]  crc32_pclmul ghash_clmulni_intel sysimgblt fb_sys_fops aesni_intel nvme crypto_simd mxm_wmi cryptd i2c_i801 glue_helper usbhid psmouse nvme_core r8169 drm ahci hid libahci realtek video wmi
[684258.426647] ---[ end trace efaab583efd31c8c ]---
[684258.426653] RIP: 0010:usb_clear_halt+0x1d/0xc0
[684258.426657] Code: 85 f6 75 e8 c3 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 55 89 f0 48 89 e5 41 54 49 89 fc 53 89 f3 c1 fb 0f 83 e3 0f 48 83 ec 08 <8b> 37 c1 e6 08 81 ce 00 00 00 80 a8 80 74 44 68 88 13 00 00 80 cb
[684258.426661] RSP: 0018:ffffad66820bfe18 EFLAGS: 00010296
[684258.426664] RAX: 00000000c0414580 RBX: 0000000000000002 RCX: ffffffff8b5be128
[684258.426667] RDX: ffffffff8b5be128 RSI: 00000000c0414580 RDI: 46b3242fd3dd81d7
[684258.426670] RBP: ffffad66820bfe30 R08: 0000000000000000 R09: 0000000000000007
[684258.426673] R10: 000000000000000e R11: ffff914aaebe9aa0 R12: 46b3242fd3dd81d7
[684258.426676] R13: ffff914aaebf0700 R14: ffff9148b99c1000 R15: ffff914a5cacd540
[684258.426679] FS:  0000000000000000(0000) GS:ffff914aaebc0000(0000) knlGS:0000000000000000
[684258.426682] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[684258.426685] CR2: 00007f1383c3d000 CR3: 00000001fda0a006 CR4: 00000000003606e0
[684258.426688] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[684258.426691] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
dhalbert commented 3 years ago

My attempt to get the Linux maintainers interested in this did not go well, basically "if it hurts don't do that". An abrupt device removal is not something they wanted to handle. MY original diagnosis was not quite correct. See the first post's links.

wolfmanjm commented 3 years ago

Removing an ACM/CDC device "abruptly" is normal there is no other way. This latest crash of mine is not MSC but ACM/CDC and appears to be deep in the USB stack. This has to be a linux driver bug I would think.

dhalbert commented 3 years ago

I think there are multiple ways Linux can be upset with connect or diconnect. The original way I was seeing was that an abrupt disconnect and reformat of the CIRCUITPY filesystem confused Linux. That appeared to be on reconnect. However, we have also seen these disconnect bugs. If you are willing to submit this to Ubuntu with kdump info and maybe write to linux-usb, that would be great. I have found one disconnected reference in recent linux-usb emails that sounds similar, but I haven't found the context for it:

https://marc.info/?l=linux-usb&m=161224956615774&w=2

This patch fixes a race condition between acm_disconnect and acm_read_bulk_callback. acm_read_bulk_callback may be called after the acm_disconnect function has been called resulting in kernel crashes

jerryneedell commented 2 years ago

This issue seems to have been resolved -- I have not had any occurrences recently. Should we close and re-open if it comes back?

dhalbert commented 2 years ago

Me neither. I have no idea what they fixed, though. I'll close it for now. Thanks for following up.