Open EastEriq opened 4 months ago
cameras don't appear on last01w dmesg -wH
when power cycled. Will try to reboot the computer.
same after reboot of both computers
$ ssh ocs@10.23.1.1 "lspci| grep Renesas"
48:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
(only one, not two)
$ ssh ocs@10.23.1.2 "lspci| grep Renesas"
none
anecdotally I remember situations of "what, we've fried the USB card? I'll try other sockets" with last01e/w in the past, which then autosolved themselves. Hm.
last-asocs -c "lspci | grep Rene|wc -l"
: last01e and last03e show only 1 out of 2, last01w none. I vaguely remember that we blamed absence of grounding for such kind of problems. To be checked at the next visit.
overcome by pulling out the cord of both computers for half an hour, and powering them up just before starting the nightly observations
Some internet pointers about resetting pcie cards, to be perhaps tried, maybe also relevant for #56. But there are side conditions apparently. I didn't bother yet to read through. https://unix.stackexchange.com/questions/73908/how-to-reset-cycle-power-to-a-pcie-device https://unix.stackexchange.com/questions/29775/reset-a-pci-device-in-linux
Happened again to Yahel 0n 16/8/24. As predictable, today still
ocs@last01e:~$ last-asocs -c "lspci | grep Rene"
[SECT] last0
[SECT] last01e
48:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)
[SECT] last01w
[SECT] last02e
For the record: on the current set of deployed computers, the Renesas cards are on
On 1e which at this point in time sees the card on :48 but not that on :49 (both trees appear in /sys/bus/pci/0000:4....
though). 'lshw' says for the former:
*-pci:1
description: PCI bridge
product: Starship/Matisse GPP Bridge
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 1.3
bus info: pci@0000:40:01.3
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:40 memory:d6600000-d66fffff
*-usb
description: USB controller
product: uPD720201 USB 3.0 Host Controller
vendor: Renesas Technology Corp.
physical id: 0
bus info: pci@0000:48:00.0
version: 03
width: 64 bits
clock: 33MHz
capabilities: pm msi msix pciexpress xhci bus_master cap_list
configuration: driver=xhci_hcd latency=0
resources: irq:107 memory:d6600000-d6601fff
*-usbhost:0
product: xHCI Host Controller
vendor: Linux 5.15.0-69-generic xhci-hcd
physical id: 0
bus info: usb@9
logical name: usb9
version: 5.15
capabilities: usb-2.00
configuration: driver=hub slots=4 speed=480Mbit/s
*-usb:0
description: Communication device
product: Celestron Focuser
physical id: 2
bus info: usb@9:2
version: 0.00
serial: MF_v1
capabilities: usb-2.00
configuration: driver=cdc_acm maxpower=100mA speed=12Mbit/s
*-usb:1
description: Generic USB device
product: USB-Serial Controller D
vendor: Prolific Technology Inc.
physical id: 4
bus info: usb@9:4
version: 4.00
capabilities: usb-1.10
configuration: driver=pl2303 maxpower=100mA speed=12Mbit/s
*-usbhost:1
product: xHCI Host Controller
vendor: Linux 5.15.0-69-generic xhci-hcd
physical id: 1
bus info: usb@10
logical name: usb10
version: 5.15
capabilities: usb-3.00
configuration: driver=hub slots=4 speed=5000Mbit/s
whereas the latter
*-pci:2
description: PCI bridge
product: Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 7.1
bus info: pci@0000:40:07.1
version: 00
width: 32 bits
clock: 33MHz
capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
configuration: driver=pcieport
resources: irq:42
*-generic UNCLAIMED
description: Non-Essential Instrumentation
product: Starship/Matisse PCIe Dummy Function
vendor: Advanced Micro Devices, Inc. [AMD]
physical id: 0
bus info: pci@0000:49:00.0
version: 00
width: 32 bits
clock: 33MHz
capabilities: pm pciexpress cap_list
I tried (as sudo su
) inconclusively in various order commands like
echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/reset
echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/remove
echo 1 > /sys/bus/pci/rescan
echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/rescan
till I locked the computer with watchdog messages like
Message from syslogd@localhost at Sep 9 08:35:13 ...
kernel:[ 3476.352383] watchdog: BUG: soft lockup - CPU#18 stuck for 21s! [kworker/18:0:3096]
Message from syslogd@localhost at Sep 9 08:35:25 ...
kernel:[ 3488.368140] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [nxserver.bin:2291]
Message from syslogd@localhost at Sep 9 08:35:29 ...
kernel:[ 3492.373394] watchdog: BUG: soft lockup - CPU#39 stuck for 522s! [bash:2814]
and dump traces in dmesg, before of which I had there
[Sep 9 08:12] pci 0000:49:00.0: Removing from iommu group 52
[Sep 9 08:14] pci 0000:49:00.0: [1022:148a] type 00 class 0x130000
[ +0.000444] pci 0000:49:00.0: Adding to iommu group 52
[Sep 9 08:24] pci 0000:49:00.0: not ready 1023ms after FLR; waiting
[ +2.047872] pci 0000:49:00.0: not ready 2047ms after FLR; waiting
[Sep 9 08:25] pci 0000:49:00.0: not ready 4095ms after FLR; waiting
[ +5.119904] pci 0000:49:00.0: not ready 8191ms after FLR; waiting
[ +9.215614] pci 0000:49:00.0: not ready 16383ms after FLR; waiting
[ +17.663489] pci 0000:49:00.0: not ready 32767ms after FLR; waiting
[Sep 9 08:26] pci 0000:49:00.0: not ready 65535ms after FLR; giving up
[ +26.079410] watchdog: BUG: soft lockup - CPU#39 stuck for 27s! [bash:2814]
[ +0.000010] Modules linked in: xt_multiport iptable_filter bpfilter ip6table_filter ip6_tables nls_iso8859_1 intel_rapl_msr intel_rapl_common nvidia_uvm(POE) nvidia_drm(POE) edac_mce_amd nvidia_modeset(POE) snd_hda_codec_hdmi snd_hda_intel iwlmvm snd_intel_dspcfg nvidia(POE) snd_intel_sdw_acpi kvm snd_hda_codec uvcvideo btusb snd_usb_audio mac80211 videobuf2_vmalloc snd_hda_core btrtl videobuf2_memops snd_usbmidi_lib crct10dif_pclmul ghash_clmulni_intel videobuf2_v4l2 btbcm snd_hwdep aesni_intel snd_seq_midi btintel videobuf2_common libarc4 snd_seq_midi_event crypto_simd cryptd snd_rawmidi drm_kms_helper bluetooth videodev iwlwifi pl2303 snd_seq binfmt_misc usbserial mc joydev rapl cec snd_pcm snd_seq_device ecdh_generic cdc_acm gigabyte_wmi wmi_bmof ecc rc_core fb_sys_fops cfg80211 snd_timer syscopyarea sysfillrect snd sysimgblt ccp soundcore k10temp mxm_wmi mac_hid sch_fq_codel nfsd msr auth_rpcgss parport_pc nfs_acl ppdev lockd lp grace ramoops pstore_blk parport drm reed_solomon
[ +0.000090] pstore_zone efi_pstore sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul igb ahci nvme libahci i2c_algo_bit dca xhci_pci nvme_core i2c_piix4 xhci_pci_renesas wmi
[ +0.000022] CPU: 39 PID: 2814 Comm: bash Tainted: P OE 5.15.0-69-generic #76~20.04.1-Ubuntu
[ +0.000005] Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS PRO WIFI/TRX40 AORUS PRO WIFI, BIOS FBi 04/12/2021
[ +0.000002] RIP: 0010:pci_mmcfg_read+0xbb/0xf0
[ +0.000008] Code: 63 c5 48 01 d8 66 8b 00 0f b7 c0 41 89 04 24 eb d8 49 63 c5 48 01 d8 8a 00 0f b6 c0 41 89 04 24 eb c7 49 63 c5 48 01 d8 8b 00 <41> 89 04 24 eb b9 e8 6a c9 4a ff 5b 41 c7 04 24 ff ff ff ff b8 ea
[ +0.000003] RSP: 0018:ffffb066c5557b70 EFLAGS: 00000286
[ +0.000003] RAX: 00000000ffffffff RBX: ffffb066d4900000 RCX: 0000000000000ffc
[ +0.000002] RDX: 00000000000000ff RSI: 0000000000000049 RDI: 0000000000000000
[ +0.000002] RBP: ffffb066c5557b98 R08: 0000000000000004 R09: ffffb066c5557bcc
[ +0.000002] R10: 0000000000000000 R11: 0000000000000001 R12: ffffb066c5557bcc
[ +0.000002] R13: 0000000000000ffc R14: 0000000000000000 R15: 0000000000000004
[ +0.000002] FS: 00007f367b121740(0000) GS:ffff9697cd9c0000(0000) knlGS:0000000000000000
[ +0.000002] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ +0.000002] CR2: 00007f87618d1140 CR3: 000000018c388000 CR4: 0000000000350ee0
[ +0.000003] Call Trace:
[ +0.000002] <TASK>
[ +0.000004] raw_pci_read+0x29/0x50
[ +0.000004] pci_read+0x2c/0x40
[ +0.000004] pci_bus_read_config_dword+0x4a/0x70
[ +0.000005] pci_read_config_dword+0x27/0x40
[ +0.000003] pci_find_next_ext_capability.part.0+0x95/0xe0
[ +0.000005] pci_find_ext_capability.part.0+0x12/0x20
[ +0.000002] pci_restore_state.part.0+0x53/0x510
[ +0.000005] pci_dev_restore+0x5c/0x70
[ +0.000003] pci_reset_function+0x4d/0x80
[ +0.000003] reset_store+0x5d/0xb0
[ +0.000004] dev_attr_store+0x17/0x30
[ +0.000004] sysfs_kf_write+0x3e/0x50
[ +0.000005] kernfs_fop_write_iter+0x13c/0x1d0
[ +0.000004] new_sync_write+0x117/0x1b0
[ +0.000005] vfs_write+0x189/0x270
[ +0.000003] ksys_write+0x67/0xf0
[ +0.000003] __x64_sys_write+0x1a/0x20
[ +0.000003] do_syscall_64+0x5c/0xc0
[ +0.000005] ? syscall_exit_to_user_mode+0x27/0x50
[ +0.000004] ? __x64_sys_close+0x12/0x50
[ +0.000004] ? do_syscall_64+0x69/0xc0
[ +0.000003] ? exit_to_user_mode_prepare+0x3d/0x1c0
[ +0.000006] ? syscall_exit_to_user_mode+0x27/0x50
[ +0.000003] ? do_syscall_64+0x69/0xc0
[ +0.000003] ? do_syscall_64+0x69/0xc0
[ +0.000004] ? do_syscall_64+0x69/0xc0
[ +0.000004] entry_SYSCALL_64_after_hwframe+0x61/0xcb
[ +0.000004] RIP: 0033:0x7f367b2320a7
[ +0.000004] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ +0.000003] RSP: 002b:00007ffee5541748 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ +0.000003] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f367b2320a7
[ +0.000003] RDX: 0000000000000002 RSI: 000055a7168206d0 RDI: 0000000000000001
[ +0.000002] RBP: 000055a7168206d0 R08: 000000000000000a R09: 0000000000000001
[ +0.000002] R10: 000055a715339017 R11: 0000000000000246 R12: 0000000000000002
[ +0.000001] R13: 00007f367b3116a0 R14: 00007f367b30d4a0 R15: 00007f367b30c8a0
[ +0.000004] </TASK>
Note: in :48 reset_method
spells bus
, in :49 it spells flr bus
After this hang a simple power cycle of the computer (without long wait) resurrected card :49.
Note Renesas cards are well known, there are (nonrelevant) askubuntu questions about them, and ultimately we could ask support from the company: https://en-support.renesas.com/dashboard
(ETA: submitted ticket https://en-support.renesas.com/ticket/392654)
dmesg | grep 0000:48
where the card is ok:
[ 1.916889] pci 0000:48:00.0: [1912:0014] type 00 class 0x0c0330
[ 1.916911] pci 0000:48:00.0: reg 0x10: [mem 0xd6700000-0xd6701fff 64bit]
[ 1.917016] pci 0000:48:00.0: PME# supported from D0 D3hot
[ 1.948574] pci_bus 0000:48: resource 1 [mem 0xd6700000-0xd67fffff]
[ 1.950520] pci 0000:48:00.0: Adding to iommu group 52
[ 2.288471] xhci_hcd 0000:48:00.0: failed to load firmware renesas_usb_fw.mem, fallback to ROM
[ 2.288613] xhci_hcd 0000:48:00.0: xHCI Host Controller
[ 2.288620] xhci_hcd 0000:48:00.0: new USB bus registered, assigned bus number 9
[ 2.288638] xhci_hcd 0000:48:00.0: Zeroing 64bit base registers, expecting fault
[ 2.326046] xhci_hcd 0000:48:00.0: hcc params 0x014051c7 hci version 0x100 quirks 0x0000001100000410
[ 2.326579] xhci_hcd 0000:48:00.0: xHCI Host Controller
[ 2.326583] xhci_hcd 0000:48:00.0: new USB bus registered, assigned bus number 10
[ 2.326589] xhci_hcd 0000:48:00.0: Host supports USB 3.0 SuperSpeed
[ 2.329674] usb usb9: SerialNumber: 0000:48:00.0
[ 2.329926] usb usb10: SerialNumber: 0000:48:00.0
but e.g. 04e does not complain failed to load firmware renesas_usb_fw.mem, fallback to ROM
. Kernel version? BIOS setting? (suggested is that legacy power saving interferes)
last-asocs -c "dmesg | grep renesas_usb_fw.mem"
:
5.15.0-92-generic
).Tried also at no avail things like
rmmod xhci_pci xhci_pci_renesas
modprobe xhci_pci xhci_pci_renesas
which on 08w unload and rediscover usb ports (trace in dmesg), but not the missing card.
Seen it in the last days on last08w, which has currently kernel 5.15.0-107-generic. According to what I wrote above the missing card should be that at 0000:48:00.0 ( lspci shows 49:00.0), but echo 1 > /sys/bus/pci/devices/0000:48:00.0/remove
causes a lot of messages in dmesg apparently related to a disk...
[Oct31 08:29] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[ +0.190489] blk_update_request: I/O error, dev sda, sector 1617672704 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[ +0.004908] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
[ +0.000083] blk_update_request: I/O error, dev sda, sector 1563168000 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[ +0.000049] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.000028] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.000020] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.081139] blk_update_request: I/O error, dev sda, sector 1509743104 op 0x0:(READ) flags 0x80700 phys_seg 25 prio class 0
[ +0.133442] blk_update_request: I/O error, dev sda, sector 1563167744 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.001404] blk_update_request: I/O error, dev sda, sector 1563168256 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +4.590364] print_req_error: 16291 callbacks suppressed
[ +0.000004] blk_update_request: I/O error, dev sda, sector 1564967936 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +0.000024] blk_update_request: I/O error, dev sda, sector 1564967936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.000583] blk_update_request: I/O error, dev sda, sector 5734992384 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +0.000026] blk_update_request: I/O error, dev sda, sector 5734992384 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.000207] blk_update_request: I/O error, dev sda, sector 1511229440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +0.000016] blk_update_request: I/O error, dev sda, sector 1511229440 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.001587] blk_update_request: I/O error, dev sda, sector 5734992896 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +0.000009] blk_update_request: I/O error, dev sda, sector 5734992896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.000005] blk_update_request: I/O error, dev sda, sector 1511229952 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ +0.000007] blk_update_request: I/O error, dev sda, sector 1511229952 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[ +0.454363] Aborting journal on device sda1-8.
[ +0.000015] Buffer I/O error on dev sda1, logical block 1708687360, lost sync page write
[ +0.000007] JBD2: Error -5 detected when updating journal superblock for sda1-8.
[ +0.020253] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm rsync: Detected aborted journal
[ +0.000077] Buffer I/O error on dev sda1, logical block 0, lost sync page write
[ +0.000007] EXT4-fs (sda1): I/O error while writing superblock
[ +0.000003] EXT4-fs (sda1): Remounting filesystem read-only
[ +4.526210] print_req_error: 18821 callbacks suppressed
...
after this a simple reboot did not bring up the missing card.
Turning off and then on 8w for 13min resurrected the card. Which btw is on 4a, not on 48, explaining the errors above.
For the record, at this moment in time M8F4 and M8C3 appear to be connected to that card (and hence F3 and C4 on the other).
Repeated attempts.
checkWholeUnit
report: