Open niekberenschot opened 8 months ago
Please provide the OS version and the kernel dmesg/serial console output.
What is /usr/share/test.img
- that is not a writable path, can you explain more in detail what you did?
The OS version is 6.1.77. Unfortunately, I don't have direct access to dmesg or the serial console output as pstore is disabled, hence I cannot provide logs directly.
Regarding /usr/share/test.img, apologies for the confusion. It should have been a writable path for testing purposes. Essentially, I ran a dd command inside a container deployed in Kubernetes. This container is configured to write data through Ceph to the USB device connected to a VM in Proxmox. The aim was to simulate a large data write operation to reproduce the issue.
I've attached a screenshot from the console showing the error encountered. The USB device is backed by Ceph Bluestore. If there are specific logs or details you need that I haven't provided, please let me know, and I'll do my best to assist further.
This looks like insufficient power to the USB device. This kind of stuff manifests as usb level resets (like you're seeing) under load.
Could you please guide me on how to change the power supply? I used smartctl in the toolbox but don't see anything particular:
core@worker-1 ~ $ toolbox --bind=/dev/sdb:/dev/sdb smartctl -a /dev/sdb
Spawning container core-docker.iolibraryfedora-latest on /var/lib/toolbox/core-docker.io_library_fedora-latest.
Press ^] three times within 1s to kill container.
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.77-flatcar] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Number: Samsung Portable SSD T7
Serial Number: [HIDDEN]
Firmware Version: FXG42P2Q
PCI Vendor/Subsystem ID: 0x144d
IEEE OUI Identifier: 0x002538
Total NVM Capacity: 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity: 0
Controller ID: 5
NVMe Version: 1.3
Number of Namespaces: 1
Namespace 1 Size/Capacity: 2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization: 600,020,086,784 [600 GB]
Namespace 1 Formatted LBA Size: 512
Local Time is: Wed Feb 28 11:29:56 2024 UTC
Firmware Updates (0x16): 3 Slots, no Reset required
Optional Admin Commands (0x0007): Security Format Frmw_DL
Optional NVM Commands (0x001f): Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03): S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size: 512 Pages
Warning Comp. Temp. Threshold: 52 Celsius
Critical Comp. Temp. Threshold: 56 Celsius
Supported Power States
St Op Max Active Idle RL RT WL WT Ent_Lat Ex_Lat
0 + 4.83W - - 0 0 0 0 0 0
1 + 3.54W - - 1 1 1 1 0 0
2 + 3.04W - - 2 2 2 2 0 0
3 - 0.0500W - - 3 3 3 3 26000 1000
Supported LBA Sizes (NSID 0x1)
Id Fmt Data Metadt Rel_Perf
0 + 512 0 0
=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
SMART/Health Information (NVMe Log 0x02)
Critical Warning: 0x00
Temperature: 33 Celsius
Available Spare: 100%
Available Spare Threshold: 10%
Percentage Used: 1%
Data Units Read: 9,996,684 [5.11 TB]
Data Units Written: 33,014,252 [16.9 TB]
Host Read Commands: 86,549,310
Host Write Commands: 486,670,083
Controller Busy Time: 9,927
Power Cycles: 130
Power On Hours: 8,011
Unsafe Shutdowns: 104
Media and Data Integrity Errors: 0
Error Information Log Entries: 247
Warning Comp. Temperature Time: 838
Critical Comp. Temperature Time: 0
Temperature Sensor 1: 33 Celsius
Temperature Sensor 2: 34 Celsius
Warning: NVMe Get Log truncated to 0x200 bytes, 0x200 bytes zero filled
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged
Self-tests not supported
Container core-docker.iolibraryfedora-latest exited successfully.
And this is what I see on the host directly:
core@worker-1 ~ $ udevadm info /dev/sdb
P: /devices/pci0000:00/0000:00:1e.0/0000:02:1b.0/usb3/3-1/3-1:1.0/host3/target3:0:0/3:0:0:0/block/sdb
M: sdb
U: block
T: disk
D: b 8:16
N: sdb
L: 0
S: disk/by-id/usb-Samsung_PSSD_T7_[HIDDEN]-0:0
S: disk/by-diskseq/3
S: disk/by-path/pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
Q: 3
E: DEVPATH=/devices/pci0000:00/0000:00:1e.0/0000:02:1b.0/usb3/3-1/3-1:1.0/host3/target3:0:0/3:0:0:0/block/sdb
E: DEVNAME=/dev/sdb
E: DEVTYPE=disk
E: DISKSEQ=3
E: MAJOR=8
E: MINOR=16
E: SUBSYSTEM=block
E: USEC_INITIALIZED=4748926
E: MPATH_SBIN_PATH=/sbin
E: DM_MULTIPATH_DEVICE_PATH=0
E: ID_BUS=usb
E: ID_MODEL=PSSD_T7
E: ID_MODEL_ENC=PSSD\x20T7\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_MODEL_ID=4001
E: ID_SERIAL=Samsung_PSSD_T7_[HIDDEN]-0:0
E: ID_SERIAL_SHORT=[HIDDEN]
E: ID_VENDOR=Samsung
E: ID_VENDOR_ENC=Samsung\x20
E: ID_VENDOR_ID=04e8
E: ID_REVISION=0
E: ID_TYPE=disk
E: ID_INSTANCE=0:0
E: ID_USB_MODEL=PSSD_T7
E: ID_USB_MODEL_ENC=PSSD\x20T7\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_USB_MODEL_ID=4001
E: ID_USB_SERIAL=Samsung_PSSD_T7_[HIDDEN]-0:0
E: ID_USB_SERIAL_SHORT=[HIDDEN]
E: ID_USB_VENDOR=Samsung
E: ID_USB_VENDOR_ENC=Samsung\x20
E: ID_USB_VENDOR_ID=04e8
E: ID_USB_REVISION=0
E: ID_USB_TYPE=disk
E: ID_USB_INSTANCE=0:0
E: ID_USB_INTERFACES=:080650:080662:
E: ID_USB_INTERFACE_NUM=00
E: ID_USB_DRIVER=uas
E: ID_PATH=pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
E: ID_PATH_TAG=pci-0000_02_1b_0-usb-0_1_1_0-scsi-0_0_0_0
E: ID_FS_TYPE=ceph_bluestore
E: ID_FS_USAGE=other
E: NVME_HOST_IFACE=none
E: DEVLINKS=/dev/disk/by-id/usb-Samsung_PSSD_T7_[HIDDEN]-0:0 /dev/disk/by-diskseq/3 /dev/disk/by-path/pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
E: TAGS=:systemd:
E: CURRENT_TAGS=:systemd:
I meant the physical USB power connection. I would only attach it through a powered USB hub, instead of powering the drive only from the USB port.
You can try one thing: add usbcore.autosuspend=-1
to the kernel commandline (through /usr/share/oem/grub.cfg). See if that helps.
I will proceed with adding usbcore.autosuspend=-1
.
Regarding the USB hub, I need to order one first before testing that option. It does seem less likely to be the solution, especially considering that other operating systems don't exhibit the same behaviour when writing large amounts of data.
I've catched one error without a shutdown:
[ 49.850116] IPv6: ADDRCONF(NETDEV_CHANGE): lxc1063cc0ebe37: link becomes ready
[ 49.871248] eth0: renamed from tmp75a75
[ 49.884181] IPv6: ADDRCONF(NETDEV_CHANGE): lxcb5b924bd7cda: link becomes ready
[ 51.921562] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[ 52.444141] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[ 53.005101] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[ 69.588868] hrtimer: interrupt took 3626908 ns
[ 81.849958] Key type ceph registered
[ 81.850759] libceph: loaded (mon/osd proto 15/24)
[ 81.864061] rbd: loaded (major 253)
[ 139.458723] libceph: mon0 (1)10.107.15.168:6789 session established
[ 139.461394] libceph: client139023 fsid 6d33d41d-1b76-44e8-8029-b09ee54e3e8b
[ 139.587906] rbd: rbd0: capacity 32212254720 features 0x1
[ 141.739099] EXT4-fs (rbd0): mounted filesystem with ordered data mode. Quota mode: none.
[ 158.290177] eth0: renamed from tmpbcce0
[ 158.333963] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[ 158.339077] IPv6: ADDRCONF(NETDEV_CHANGE): lxc62a804eafadc: link becomes ready
[ 162.655814] sd 3:0:0:0: [sdb] tag#29 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT
[ 162.656973] sd 3:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 00 5e 40 38 00 00 08 00
[ 162.660498] sd 3:0:0:0: [sdb] tag#27 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT
[ 162.661612] sd 3:0:0:0: [sdb] tag#27 CDB: Write(10) 2a 00 00 5e 3c 30 00 00 08 00
[ 162.665129] sd 3:0:0:0: [sdb] tag#26 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT
[ 162.667071] sd 3:0:0:0: [sdb] tag#26 CDB: Write(10) 2a 00 00 5e 3c 38 00 04 00 00
[ 162.669810] sd 3:0:0:0: [sdb] tag#25 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT
[ 162.671013] sd 3:0:0:0: [sdb] tag#25 CDB: Write(10) 2a 00 00 5e 38 28 00 00 08 00
[ 162.675267] sd 3:0:0:0: [sdb] tag#24 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT
[ 162.677964] sd 3:0:0:0: [sdb] tag#24 CDB: Write(10) 2a 00 00 5e 38 30 00 04 00 00
[ 162.692564] sd 3:0:0:0: [sdb] tag#23 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT
[ 162.693937] sd 3:0:0:0: [sdb] tag#23 CDB: Write(10) 2a 00 00 5e 34 20 00 00 08 00
[ 162.699795] sd 3:0:0:0: [sdb] tag#22 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT
[ 162.701112] sd 3:0:0:0: [sdb] tag#22 CDB: Write(10) 2a 00 00 5e 34 28 00 04 00 00
[ 162.703248] sd 3:0:0:0: [sdb] tag#21 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT
[ 162.704588] sd 3:0:0:0: [sdb] tag#21 CDB: Write(10) 2a 00 00 5e 30 20 00 04 00 00
[ 162.710120] sd 3:0:0:0: [sdb] tag#20 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN
[ 162.711411] sd 3:0:0:0: [sdb] tag#20 CDB: Mode Sense(6) 1a 00 3f 00 04 00
[ 162.721862] scsi host3: uas_eh_device_reset_handler start
[ 163.003256] usb 3-1: reset SuperSpeed USB device number 2 using xhci_hcd
[ 163.023982] scsi host3: uas_eh_device_reset_handler success
@niekberenschot I have one suggestion. Some of these usb-scsi converters are really bad at scsi compatibility, and there are cases where it's better to disable uas and rely on the usb-storage driver. May I suggested you follow these instructions https://smitchell.github.io/how-to-bind-to-the-right-usb-storage-driver and add a quirk for your device.
If i'm not mistaken it should require adding this to the kernel command line:
usb-storage.quirks=04e8:4001:u
.
Let me know if that helps.
Thank you for the suggestion! It appears that adding the quirk to the kernel does indeed resolve the crashing issue. However, the maximum speed is significantly reduced to around 40MB/s from the previous 1000MB/s with UAS enabled. Unfortunately, this is too slow for me.
Description
The bug arises when attempting to write data larger than approximately 50MB. Consistently, I receive the error message "tag data cmplt err -71 uas-tag 28 inflight," leading to a crash and shutdown of Flatcar.
Impact
It's worth noting that the data write sometimes succeeds:
Environment and steps to reproduce
The OS is equipped with a Samsung SSD T7 USB drive. While I suspect the problem may be related to these USB devices, I'm not entirely certain.
Expected behavior
Flatcar OS shouldn't crash / shutdown.