Plebian-Linux / quartz64-images

GitHub Actions Repository for automatically generated images for the Quartz64 family of single board computers
https://plebian.org
GNU General Public License v3.0
41 stars 10 forks source link

HARDWARE ISSUE: Failures with SOQuartz & TuringPi2 #40

Closed acelinkio closed 1 year ago

acelinkio commented 1 year ago

EDIT: PROBLEM WAS ISOLATED TO ONE COMPUTE MODULE. IDENTIFIED TO BE HARDWARE ISSUE.

Hey!

I recently got started with the Plebian on SOQuartz compute modules hosted in a TuringPi2. Everything appears to work except when utilizing SOQuartz slot3 of the TuringPi2, where the compute module will crash along while also bringing down the rest of the networking on the TuringPi2.

Able to reproduce with the following:

Compute Module crashes & all devices on TuringPi2 become unreachable. Everything recovers when the module in slot3 is powered off. Powering back on the compute module in slot3 causes another crash with ~15 minutes.

I have no issues with SOQuartz in slots 1,2, and 4. The major difference between those and slot 3 is two SATA ports exposed. https://help.turingpi.com/hc/en-us/articles/8685766680477-Specifications-and-I-O-Ports. M.2 is not exposed in any port because there is only one PCIE lane on SOQuartz. https://turingpi.com/product/cm4-adapter/ is being used for connecting the SOQuartz in.

Unsure if it is related but there is an warning during the open-iscsi installation.

W: Possible missing firmware /lib/firmware/rockchip/dptx.bin for module rockchipdrm
update-initramfs: Generating /boot/initrd.img-6.1.0-7-arm64
CounterPillow commented 1 year ago

Do they have schematics available somewhere? The way the CM4 image works might not be compatible with this carrier board, and I'd need to take a closer look at how things are wired up to figure out what's going on.

Though I assume the SATA is done through a PCIe SATA controller, which might make this related to the PCIe ranges bug which I should finally upstream the fix for.

acelinkio commented 1 year ago

Reaching out to folks to see if they can provide schematics.

Also forgot to mention this issue that seems related. https://github.com/wenyi0421/turing-pi/issues/13 although this image appears to be u-boot and able to successfully load.

daniel-kukiela commented 1 year ago

Hi! To start, I'm not a part of the Turing Machines team, but I know a lot about the Turing Pi 2 board and can answer some questions. I also talked about this problem with @acelinkit over the Turing Pi Discord server to narrow down the possible cause.

Node 3 indeed has a SATA controlled hooked up using the PCIe. The chip used is ASM1061. It works fine (and out of the box) with CM4 + Raspberry Pi OS and CM4 + DietPi. Raspberry Pi + Ubuntu needs an additional package (linux-modules-extra-raspi). Then, Nvidia Jetson modules also work out of the box with this controller. In case you wonder how you mix CM4 (and CM4-compatible) modules with the Jetson modules is that the TPI2 board has Jetson-compatible SODIMM connectors and to insert a CM4 module you use a so-called adapter board.

The schematics are not available publicly but do not hesitate to ask any follow-up questions if you have any.

CounterPillow commented 1 year ago

Okay, I'll make a patched plebian devicetree package with the PCIe ranges fix in the coming days and have you try that out, it's a shot in the dark but it might be related. Basically, right now, the memory ranges set for PCIe are a bit scuffed in the mainline kernel, which wreaks havoc with some PCIe devices.

CounterPillow commented 1 year ago

Okay, here's a devicetree deb for you to try out with fixed PCIe ranges: https://overviewer.org/~pillow/up/75bea78e59/devicetrees-plebian-quartz64-20230601130309-arm64.deb

Install with sudo dpkg -i devicetrees-plebian-quartz64-20230601130309-arm64.deb and then reboot.

Let me know if this improves things in any way.

acelinkio commented 1 year ago

Reimaged each of the SOQuartz and applied that package. Kubernetes and Longhorn have been running stable for the last 2 hours.

Will follow up tomorrow and let you know if anything comes up.

acelinkio commented 1 year ago

Had a couple of errors this morning. Decided to swap some of the modules around and noticed the problem following one specific module. Installed it into slot 1 which has HDMI output and captured this.

20230602_095331

From there I also grabbed some of the logs from journalctl -p 3 -x

Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.pine64,soquartz-cm4io.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: bluetooth hci0: firmware: failed to load brcm/BCM.hcd (-2)
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: firmware Patch file not found, tried:
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM4345C0.pine64,soquartz-cm4io.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM4345C0.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM.pine64,soquartz-cm4io.hcd'
Jun 02 05:26:06 soquartz3 kernel: Bluetooth: hci0: BCM: 'brcm/BCM.hcd'
Jun 02 05:26:06 soquartz3 kernel: brcmfmac: brcmf_sdio_htclk: HT Avail timeout (1000000): clkctl 0x50
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init vcp plugin
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init mcp plugin
Jun 02 05:26:06 soquartz3 bluetoothd[523]: src/plugin.c:plugin_init() Failed to init bap plugin
Jun 02 05:26:07 soquartz3 bluetoothd[523]: profiles/sap/server.c:sap_server_register() Sap driver initialization failed.
Jun 02 05:26:07 soquartz3 bluetoothd[523]: sap-server: Operation not permitted (1)
Jun 02 05:26:07 soquartz3 systemctl[547]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 02 06:24:05 soquartz3 kernel: Unable to handle kernel paging request at virtual address ffff8000015ec028
Jun 02 06:24:05 soquartz3 kernel: Mem abort info:
Jun 02 06:24:05 soquartz3 kernel:   ESR = 0x0000000086000004
Jun 02 06:24:05 soquartz3 kernel:   EC = 0x21: IABT (current EL), IL = 32 bits
Jun 02 06:24:05 soquartz3 kernel:   SET = 0, FnV = 0
Jun 02 06:24:05 soquartz3 kernel:   EA = 0, S1PTW = 0
Jun 02 06:24:05 soquartz3 kernel:   FSC = 0x04: level 0 translation fault
Jun 02 06:24:05 soquartz3 kernel: swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003c69000
Jun 02 06:24:05 soquartz3 kernel: [ffff8000015ec028] pgd=10000000effff003, p4d=10000000effff003, pud=10000000efffe003, pmd=1000000004ca7003, pte=004000000dd02783
Jun 02 06:24:05 soquartz3 kernel: Internal error: Oops: 0000000086000004 [#1] SMP

and another snippet

-- Boot 8ba7fa9e3d84472b9d57ca56501398f2 --
Feb 28 11:15:47 soquartz3 kernel: arm-scmi firmware:scmi: Failed. SCMI protocol 22 not active.
Feb 28 11:15:47 soquartz3 kernel: arm-scmi firmware:scmi: Failed. SCMI protocol 17 not active.
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rockchip-naneng-combphy fe840000.phy: failed to create combphy
Feb 28 11:15:47 soquartz3 kernel: rtc-pcf85063 1-0051: RTC chip is not present
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Feb 28 11:15:47 soquartz3 kernel: rk_gmac-dwmac fe010000.ethernet: phy regulator is not available yet, deferred probing
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.bin (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.pine64,soquartz-cm4io.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.txt (-2)
Jun 02 08:41:23 soquartz3 kernel: brcmfmac mmc2:0001:1: firmware: failed to load brcm/brcmfmac43455-sdio.txt (-2)
Jun 02 08:41:24 soquartz3 kernel: of_dma_request_slave_channel: dma-names property of node '/serial@fe650000' missing or empty
Jun 02 08:41:24 soquartz3 kernel: brcmfmac: brcmf_sdio_htclk: HT Avail timeout (1000000): clkctl 0x50
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init vcp plugin
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init mcp plugin
Jun 02 08:41:25 soquartz3 bluetoothd[554]: src/plugin.c:plugin_init() Failed to init bap plugin
Jun 02 08:41:25 soquartz3 systemctl[567]: Failed to get unit file state for nm-cloud-setup.service: No such file or directory
Jun 02 08:41:26 soquartz3 kernel: Bluetooth: hci0: command 0x0c03 tx timeout
Jun 02 08:41:34 soquartz3 kernel: Bluetooth: hci0: BCM: Reset failed (-110)
Jun 02 14:15:47 soquartz3 kernel: Unable to handle kernel paging request at virtual address ffff8000015d8028
Jun 02 14:15:47 soquartz3 kernel: Mem abort info:
Jun 02 14:15:47 soquartz3 kernel:   ESR = 0x0000000086000004
Jun 02 14:15:47 soquartz3 kernel:   EC = 0x21: IABT (current EL), IL = 32 bits
Jun 02 14:15:47 soquartz3 kernel:   SET = 0, FnV = 0
Jun 02 14:15:47 soquartz3 kernel:   EA = 0, S1PTW = 0
Jun 02 14:15:47 soquartz3 kernel:   FSC = 0x04: level 0 translation fault
Jun 02 14:15:47 soquartz3 kernel: swapper pgtable: 4k pages, 48-bit VAs, pgdp=0000000003c69000
Jun 02 14:15:47 soquartz3 kernel: [ffff8000015d8028] pgd=10000000effff003, p4d=10000000effff003, pud=10000000efffe003, pmd=1000000006080003, pte=0040000004bd6783
Jun 02 14:15:47 soquartz3 kernel: Internal error: Oops: 0000000086000004 [#1] SMP
-- Boot c3d9b0fdc1c84c34a853c6f364c89122 --

Currently running some memtester commands to try testing memory.

acelinkio commented 1 year ago

Closing ticket. This problem is hardware failure on one SOQuartz module hardware. The same module is failing no matter which slot it is in. Running memtester I was able to have the node hang/crash.

Rotated the other 2 modules I have through slot 3 in the TuringPi and each one ran 3+ hours without any issue.

Appreciate your help troubleshooting this issue!