blumzi / LAST_issues

A place to discuss and manage LAST issues
0 stars 0 forks source link

M1F2, M1C3 and M1C4 don't connect #55

Open EastEriq opened 3 months ago

EastEriq commented 3 months ago

Repeated attempts. checkWholeUnit report:

{13:37:52.192|obs.unitCS[01]} Checking definitions and connections of unit 01:
{13:37:52.427|obs.unitCS[01]} Slave 1 status: "alive"
{13:37:52.556|obs.unitCS[01]} Slave 2 status: "alive"
{13:37:52.700|obs.unitCS[01]} Slave 3 status: "alive"
{13:37:54.601|obs.unitCS[01]} Slave 4 status: "alive"
{13:37:55.271|obs.unitCS[01]} mount is powered
{13:37:56.008|obs.unitCS[01]} checking status of camera 1
{13:37:56.080|obs.unitCS[01]} camera 1 is idle, good
{13:37:57.793|obs.unitCS[01]} checking status of camera 2
{13:37:57.882|obs.unitCS[01]} camera 2 is idle, good
{13:37:59.249|obs.unitCS[01]} checking status of camera 3
{13:37:59.361|obs.unitCS[01]} camera 3 status is unknown, bad sign
{13:38:00.018|obs.unitCS[01]} camera QHY600M-a51e1c41b5d4b48f7 is not even known registered on the computer
{13:38:00.018|obs.unitCS[01]} check if the camera is physically connected and powered,
{13:38:00.018|obs.unitCS[01]}   or otherwise check that the obs.camera configuration file is correct
{13:38:00.655|obs.unitCS[01]} checking status of camera 4
{13:38:00.725|obs.unitCS[01]} camera 4 status is unknown, bad sign
{13:38:01.431|obs.unitCS[01]} camera QHY600M-14b54b419653bc47e is not even known registered on the computer
{13:38:01.431|obs.unitCS[01]} check if the camera is physically connected and powered,
{13:38:01.431|obs.unitCS[01]}   or otherwise check that the obs.camera configuration file is correct
{13:38:02.114|obs.unitCS[01]} focuser 1 check passed
{13:38:02.305|obs.unitCS[01]} focuser 2 is unknown, try again later
{13:38:03.292|obs.unitCS[01]} focuser 3 check passed
{13:38:04.000|obs.unitCS[01]} focuser 4 check passed
{13:38:04.032|obs.unitCS[01]} check failed!
EastEriq commented 3 months ago

cameras don't appear on last01w dmesg -wH when power cycled. Will try to reboot the computer.

EastEriq commented 3 months ago

same after reboot of both computers

EastEriq commented 3 months ago
$ ssh ocs@10.23.1.1 "lspci| grep Renesas"
48:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)

(only one, not two)

$ ssh ocs@10.23.1.2 "lspci| grep Renesas"

none

EastEriq commented 3 months ago

anecdotally I remember situations of "what, we've fried the USB card? I'll try other sockets" with last01e/w in the past, which then autosolved themselves. Hm.

EastEriq commented 3 months ago

last-asocs -c "lspci | grep Rene|wc -l": last01e and last03e show only 1 out of 2, last01w none. I vaguely remember that we blamed absence of grounding for such kind of problems. To be checked at the next visit.

EastEriq commented 3 months ago

overcome by pulling out the cord of both computers for half an hour, and powering them up just before starting the nightly observations

EastEriq commented 3 months ago

Some internet pointers about resetting pcie cards, to be perhaps tried, maybe also relevant for #56. But there are side conditions apparently. I didn't bother yet to read through. https://unix.stackexchange.com/questions/73908/how-to-reset-cycle-power-to-a-pcie-device https://unix.stackexchange.com/questions/29775/reset-a-pci-device-in-linux

EastEriq commented 2 months ago

Happened again to Yahel 0n 16/8/24. As predictable, today still

ocs@last01e:~$ last-asocs -c "lspci | grep Rene"

[SECT] last0

[SECT] last01e
48:00.0 USB controller: Renesas Technology Corp. uPD720201 USB 3.0 Host Controller (rev 03)

[SECT] last01w

[SECT] last02e
EastEriq commented 2 months ago

For the record: on the current set of deployed computers, the Renesas cards are on

EastEriq commented 2 months ago

On 1e which at this point in time sees the card on :48 but not that on :49 (both trees appear in /sys/bus/pci/0000:4.... though). 'lshw' says for the former:

        *-pci:1
             description: PCI bridge
             product: Starship/Matisse GPP Bridge
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 1.3
             bus info: pci@0000:40:01.3
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:40 memory:d6600000-d66fffff
           *-usb
                description: USB controller
                product: uPD720201 USB 3.0 Host Controller
                vendor: Renesas Technology Corp.
                physical id: 0
                bus info: pci@0000:48:00.0
                version: 03
                width: 64 bits
                clock: 33MHz
                capabilities: pm msi msix pciexpress xhci bus_master cap_list
                configuration: driver=xhci_hcd latency=0
                resources: irq:107 memory:d6600000-d6601fff
              *-usbhost:0
                   product: xHCI Host Controller
                   vendor: Linux 5.15.0-69-generic xhci-hcd
                   physical id: 0
                   bus info: usb@9
                   logical name: usb9
                   version: 5.15
                   capabilities: usb-2.00
                   configuration: driver=hub slots=4 speed=480Mbit/s
                 *-usb:0
                      description: Communication device
                      product: Celestron Focuser
                      physical id: 2
                      bus info: usb@9:2
                      version: 0.00
                      serial: MF_v1
                      capabilities: usb-2.00
                      configuration: driver=cdc_acm maxpower=100mA speed=12Mbit/s
                 *-usb:1
                      description: Generic USB device
                      product: USB-Serial Controller D
                      vendor: Prolific Technology Inc.
                      physical id: 4
                      bus info: usb@9:4
                      version: 4.00
                      capabilities: usb-1.10
                      configuration: driver=pl2303 maxpower=100mA speed=12Mbit/s
              *-usbhost:1
                   product: xHCI Host Controller
                   vendor: Linux 5.15.0-69-generic xhci-hcd
                   physical id: 1
                   bus info: usb@10
                   logical name: usb10
                   version: 5.15
                   capabilities: usb-3.00
                   configuration: driver=hub slots=4 speed=5000Mbit/s

whereas the latter

        *-pci:2
             description: PCI bridge
             product: Starship/Matisse Internal PCIe GPP Bridge 0 to bus[E:B]
             vendor: Advanced Micro Devices, Inc. [AMD]
             physical id: 7.1
             bus info: pci@0000:40:07.1
             version: 00
             width: 32 bits
             clock: 33MHz
             capabilities: pci pm pciexpress msi ht normal_decode bus_master cap_list
             configuration: driver=pcieport
             resources: irq:42
           *-generic UNCLAIMED
                description: Non-Essential Instrumentation
                product: Starship/Matisse PCIe Dummy Function
                vendor: Advanced Micro Devices, Inc. [AMD]
                physical id: 0
                bus info: pci@0000:49:00.0
                version: 00
                width: 32 bits
                clock: 33MHz
                capabilities: pm pciexpress cap_list
EastEriq commented 2 months ago

I tried (as sudo su) inconclusively in various order commands like

echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/reset
echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/remove
echo 1 > /sys/bus/pci/rescan
echo 1 > /sys/devices/pci0000:40:07.1/0000:49:00.0/rescan

till I locked the computer with watchdog messages like

Message from syslogd@localhost at Sep  9 08:35:13 ...
 kernel:[ 3476.352383] watchdog: BUG: soft lockup - CPU#18 stuck for 21s! [kworker/18:0:3096]

Message from syslogd@localhost at Sep  9 08:35:25 ...
 kernel:[ 3488.368140] watchdog: BUG: soft lockup - CPU#15 stuck for 22s! [nxserver.bin:2291]

Message from syslogd@localhost at Sep  9 08:35:29 ...
 kernel:[ 3492.373394] watchdog: BUG: soft lockup - CPU#39 stuck for 522s! [bash:2814]

and dump traces in dmesg, before of which I had there

[Sep 9 08:12] pci 0000:49:00.0: Removing from iommu group 52
[Sep 9 08:14] pci 0000:49:00.0: [1022:148a] type 00 class 0x130000
[  +0.000444] pci 0000:49:00.0: Adding to iommu group 52
[Sep 9 08:24] pci 0000:49:00.0: not ready 1023ms after FLR; waiting
[  +2.047872] pci 0000:49:00.0: not ready 2047ms after FLR; waiting
[Sep 9 08:25] pci 0000:49:00.0: not ready 4095ms after FLR; waiting
[  +5.119904] pci 0000:49:00.0: not ready 8191ms after FLR; waiting
[  +9.215614] pci 0000:49:00.0: not ready 16383ms after FLR; waiting
[ +17.663489] pci 0000:49:00.0: not ready 32767ms after FLR; waiting
[Sep 9 08:26] pci 0000:49:00.0: not ready 65535ms after FLR; giving up
[ +26.079410] watchdog: BUG: soft lockup - CPU#39 stuck for 27s! [bash:2814]
[  +0.000010] Modules linked in: xt_multiport iptable_filter bpfilter ip6table_filter ip6_tables nls_iso8859_1 intel_rapl_msr intel_rapl_common nvidia_uvm(POE) nvidia_drm(POE) edac_mce_amd nvidia_modeset(POE) snd_hda_codec_hdmi snd_hda_intel iwlmvm snd_intel_dspcfg nvidia(POE) snd_intel_sdw_acpi kvm snd_hda_codec uvcvideo btusb snd_usb_audio mac80211 videobuf2_vmalloc snd_hda_core btrtl videobuf2_memops snd_usbmidi_lib crct10dif_pclmul ghash_clmulni_intel videobuf2_v4l2 btbcm snd_hwdep aesni_intel snd_seq_midi btintel videobuf2_common libarc4 snd_seq_midi_event crypto_simd cryptd snd_rawmidi drm_kms_helper bluetooth videodev iwlwifi pl2303 snd_seq binfmt_misc usbserial mc joydev rapl cec snd_pcm snd_seq_device ecdh_generic cdc_acm gigabyte_wmi wmi_bmof ecc rc_core fb_sys_fops cfg80211 snd_timer syscopyarea sysfillrect snd sysimgblt ccp soundcore k10temp mxm_wmi mac_hid sch_fq_codel nfsd msr auth_rpcgss parport_pc nfs_acl ppdev lockd lp grace ramoops pstore_blk parport drm reed_solomon
[  +0.000090]  pstore_zone efi_pstore sunrpc ip_tables x_tables autofs4 hid_generic usbhid hid crc32_pclmul igb ahci nvme libahci i2c_algo_bit dca xhci_pci nvme_core i2c_piix4 xhci_pci_renesas wmi
[  +0.000022] CPU: 39 PID: 2814 Comm: bash Tainted: P           OE     5.15.0-69-generic #76~20.04.1-Ubuntu
[  +0.000005] Hardware name: Gigabyte Technology Co., Ltd. TRX40 AORUS PRO WIFI/TRX40 AORUS PRO WIFI, BIOS FBi 04/12/2021
[  +0.000002] RIP: 0010:pci_mmcfg_read+0xbb/0xf0
[  +0.000008] Code: 63 c5 48 01 d8 66 8b 00 0f b7 c0 41 89 04 24 eb d8 49 63 c5 48 01 d8 8a 00 0f b6 c0 41 89 04 24 eb c7 49 63 c5 48 01 d8 8b 00 <41> 89 04 24 eb b9 e8 6a c9 4a ff 5b 41 c7 04 24 ff ff ff ff b8 ea
[  +0.000003] RSP: 0018:ffffb066c5557b70 EFLAGS: 00000286
[  +0.000003] RAX: 00000000ffffffff RBX: ffffb066d4900000 RCX: 0000000000000ffc
[  +0.000002] RDX: 00000000000000ff RSI: 0000000000000049 RDI: 0000000000000000
[  +0.000002] RBP: ffffb066c5557b98 R08: 0000000000000004 R09: ffffb066c5557bcc
[  +0.000002] R10: 0000000000000000 R11: 0000000000000001 R12: ffffb066c5557bcc
[  +0.000002] R13: 0000000000000ffc R14: 0000000000000000 R15: 0000000000000004
[  +0.000002] FS:  00007f367b121740(0000) GS:ffff9697cd9c0000(0000) knlGS:0000000000000000
[  +0.000002] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  +0.000002] CR2: 00007f87618d1140 CR3: 000000018c388000 CR4: 0000000000350ee0
[  +0.000003] Call Trace:
[  +0.000002]  <TASK>
[  +0.000004]  raw_pci_read+0x29/0x50
[  +0.000004]  pci_read+0x2c/0x40
[  +0.000004]  pci_bus_read_config_dword+0x4a/0x70
[  +0.000005]  pci_read_config_dword+0x27/0x40
[  +0.000003]  pci_find_next_ext_capability.part.0+0x95/0xe0
[  +0.000005]  pci_find_ext_capability.part.0+0x12/0x20
[  +0.000002]  pci_restore_state.part.0+0x53/0x510
[  +0.000005]  pci_dev_restore+0x5c/0x70
[  +0.000003]  pci_reset_function+0x4d/0x80
[  +0.000003]  reset_store+0x5d/0xb0
[  +0.000004]  dev_attr_store+0x17/0x30
[  +0.000004]  sysfs_kf_write+0x3e/0x50
[  +0.000005]  kernfs_fop_write_iter+0x13c/0x1d0
[  +0.000004]  new_sync_write+0x117/0x1b0
[  +0.000005]  vfs_write+0x189/0x270
[  +0.000003]  ksys_write+0x67/0xf0
[  +0.000003]  __x64_sys_write+0x1a/0x20
[  +0.000003]  do_syscall_64+0x5c/0xc0
[  +0.000005]  ? syscall_exit_to_user_mode+0x27/0x50
[  +0.000004]  ? __x64_sys_close+0x12/0x50
[  +0.000004]  ? do_syscall_64+0x69/0xc0
[  +0.000003]  ? exit_to_user_mode_prepare+0x3d/0x1c0
[  +0.000006]  ? syscall_exit_to_user_mode+0x27/0x50
[  +0.000003]  ? do_syscall_64+0x69/0xc0
[  +0.000003]  ? do_syscall_64+0x69/0xc0
[  +0.000004]  ? do_syscall_64+0x69/0xc0
[  +0.000004]  entry_SYSCALL_64_after_hwframe+0x61/0xcb
[  +0.000004] RIP: 0033:0x7f367b2320a7
[  +0.000004] Code: 64 89 02 48 c7 c0 ff ff ff ff eb bb 0f 1f 80 00 00 00 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[  +0.000003] RSP: 002b:00007ffee5541748 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[  +0.000003] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007f367b2320a7
[  +0.000003] RDX: 0000000000000002 RSI: 000055a7168206d0 RDI: 0000000000000001
[  +0.000002] RBP: 000055a7168206d0 R08: 000000000000000a R09: 0000000000000001
[  +0.000002] R10: 000055a715339017 R11: 0000000000000246 R12: 0000000000000002
[  +0.000001] R13: 00007f367b3116a0 R14: 00007f367b30d4a0 R15: 00007f367b30c8a0
[  +0.000004]  </TASK>

Note: in :48 reset_method spells bus, in :49 it spells flr bus

EastEriq commented 2 months ago

After this hang a simple power cycle of the computer (without long wait) resurrected card :49.

EastEriq commented 2 months ago

Note Renesas cards are well known, there are (nonrelevant) askubuntu questions about them, and ultimately we could ask support from the company: https://en-support.renesas.com/dashboard

(ETA: submitted ticket https://en-support.renesas.com/ticket/392654)

EastEriq commented 2 months ago

dmesg | grep 0000:48 where the card is ok:

[    1.916889] pci 0000:48:00.0: [1912:0014] type 00 class 0x0c0330
[    1.916911] pci 0000:48:00.0: reg 0x10: [mem 0xd6700000-0xd6701fff 64bit]
[    1.917016] pci 0000:48:00.0: PME# supported from D0 D3hot
[    1.948574] pci_bus 0000:48: resource 1 [mem 0xd6700000-0xd67fffff]
[    1.950520] pci 0000:48:00.0: Adding to iommu group 52
[    2.288471] xhci_hcd 0000:48:00.0: failed to load firmware renesas_usb_fw.mem, fallback to ROM
[    2.288613] xhci_hcd 0000:48:00.0: xHCI Host Controller
[    2.288620] xhci_hcd 0000:48:00.0: new USB bus registered, assigned bus number 9
[    2.288638] xhci_hcd 0000:48:00.0: Zeroing 64bit base registers, expecting fault
[    2.326046] xhci_hcd 0000:48:00.0: hcc params 0x014051c7 hci version 0x100 quirks 0x0000001100000410
[    2.326579] xhci_hcd 0000:48:00.0: xHCI Host Controller
[    2.326583] xhci_hcd 0000:48:00.0: new USB bus registered, assigned bus number 10
[    2.326589] xhci_hcd 0000:48:00.0: Host supports USB 3.0 SuperSpeed
[    2.329674] usb usb9: SerialNumber: 0000:48:00.0
[    2.329926] usb usb10: SerialNumber: 0000:48:00.0

but e.g. 04e does not complain failed to load firmware renesas_usb_fw.mem, fallback to ROM. Kernel version? BIOS setting? (suggested is that legacy power saving interferes)

EastEriq commented 2 months ago

last-asocs -c "dmesg | grep renesas_usb_fw.mem":

Tried also at no avail things like

rmmod xhci_pci xhci_pci_renesas
modprobe xhci_pci xhci_pci_renesas

which on 08w unload and rediscover usb ports (trace in dmesg), but not the missing card.

EastEriq commented 1 week ago

Seen it in the last days on last08w, which has currently kernel 5.15.0-107-generic. According to what I wrote above the missing card should be that at 0000:48:00.0 ( lspci shows 49:00.0), but echo 1 > /sys/bus/pci/devices/0000:48:00.0/remove causes a lot of messages in dmesg apparently related to a disk...

[Oct31 08:29] blk_update_request: I/O error, dev sda, sector 0 op 0x1:(WRITE) flags 0x800 phys_seg 0 prio class 0
[  +0.190489] blk_update_request: I/O error, dev sda, sector 1617672704 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  +0.004908] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x80700 phys_seg 64 prio class 0
[  +0.000083] blk_update_request: I/O error, dev sda, sector 1563168000 op 0x0:(READ) flags 0x80700 phys_seg 32 prio class 0
[  +0.000049] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000028] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000020] blk_update_request: I/O error, dev sda, sector 1563167488 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.081139] blk_update_request: I/O error, dev sda, sector 1509743104 op 0x0:(READ) flags 0x80700 phys_seg 25 prio class 0
[  +0.133442] blk_update_request: I/O error, dev sda, sector 1563167744 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.001404] blk_update_request: I/O error, dev sda, sector 1563168256 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +4.590364] print_req_error: 16291 callbacks suppressed
[  +0.000004] blk_update_request: I/O error, dev sda, sector 1564967936 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000024] blk_update_request: I/O error, dev sda, sector 1564967936 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000583] blk_update_request: I/O error, dev sda, sector 5734992384 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000026] blk_update_request: I/O error, dev sda, sector 5734992384 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000207] blk_update_request: I/O error, dev sda, sector 1511229440 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000016] blk_update_request: I/O error, dev sda, sector 1511229440 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.001587] blk_update_request: I/O error, dev sda, sector 5734992896 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000009] blk_update_request: I/O error, dev sda, sector 5734992896 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.000005] blk_update_request: I/O error, dev sda, sector 1511229952 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[  +0.000007] blk_update_request: I/O error, dev sda, sector 1511229952 op 0x0:(READ) flags 0x0 phys_seg 1 prio class 0
[  +0.454363] Aborting journal on device sda1-8.
[  +0.000015] Buffer I/O error on dev sda1, logical block 1708687360, lost sync page write
[  +0.000007] JBD2: Error -5 detected when updating journal superblock for sda1-8.
[  +0.020253] EXT4-fs error (device sda1): ext4_journal_check_start:83: comm rsync: Detected aborted journal
[  +0.000077] Buffer I/O error on dev sda1, logical block 0, lost sync page write
[  +0.000007] EXT4-fs (sda1): I/O error while writing superblock
[  +0.000003] EXT4-fs (sda1): Remounting filesystem read-only
[  +4.526210] print_req_error: 18821 callbacks suppressed
...

after this a simple reboot did not bring up the missing card.

EastEriq commented 1 week ago

Turning off and then on 8w for 13min resurrected the card. Which btw is on 4a, not on 48, explaining the errors above.

For the record, at this moment in time M8F4 and M8C3 appear to be connected to that card (and hence F3 and C4 on the other).