flatcar / Flatcar

Flatcar project repository for issue tracking, project documentation, etc.
https://www.flatcar.org/
Apache License 2.0
654 stars 27 forks source link

USB Storage crash: tag data cmplt err -71 uas-tag 28 inflight #1378

Open niekberenschot opened 4 months ago

niekberenschot commented 4 months ago

Description

The bug arises when attempting to write data larger than approximately 50MB. Consistently, I receive the error message "tag data cmplt err -71 uas-tag 28 inflight," leading to a crash and shutdown of Flatcar.

Impact

It's worth noting that the data write sometimes succeeds:

# dd if=/dev/zero of=/usr/share/test.img bs=1G count=1
1+0 records in
1+0 records out
1073741824 bytes (1.1 GB, 1.0 GiB) copied, 0.609573 s, 1.8 GB/s

Environment and steps to reproduce

The OS is equipped with a Samsung SSD T7 USB drive. While I suspect the problem may be related to these USB devices, I'm not entirely certain.

Expected behavior

Flatcar OS shouldn't crash / shutdown.

pothos commented 4 months ago

Please provide the OS version and the kernel dmesg/serial console output. What is /usr/share/test.img - that is not a writable path, can you explain more in detail what you did?

niekberenschot commented 4 months ago

The OS version is 6.1.77. Unfortunately, I don't have direct access to dmesg or the serial console output as pstore is disabled, hence I cannot provide logs directly.

Regarding /usr/share/test.img, apologies for the confusion. It should have been a writable path for testing purposes. Essentially, I ran a dd command inside a container deployed in Kubernetes. This container is configured to write data through Ceph to the USB device connected to a VM in Proxmox. The aim was to simulate a large data write operation to reproduce the issue.

I've attached a screenshot from the console showing the error encountered. The USB device is backed by Ceph Bluestore. If there are specific logs or details you need that I haven't provided, please let me know, and I'll do my best to assist further.

Screenshot 2024-02-28 at 11 53 18
jepio commented 4 months ago

This looks like insufficient power to the USB device. This kind of stuff manifests as usb level resets (like you're seeing) under load.

niekberenschot commented 4 months ago

Could you please guide me on how to change the power supply? I used smartctl in the toolbox but don't see anything particular:

core@worker-1 ~ $ toolbox --bind=/dev/sdb:/dev/sdb smartctl -a /dev/sdb
Spawning container core-docker.iolibraryfedora-latest on /var/lib/toolbox/core-docker.io_library_fedora-latest.
Press ^] three times within 1s to kill container.
smartctl 7.4 2023-08-01 r5530 [x86_64-linux-6.1.77-flatcar] (local build)
Copyright (C) 2002-23, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Number:                       Samsung Portable SSD T7
Serial Number:                     [HIDDEN]
Firmware Version:                   FXG42P2Q
PCI Vendor/Subsystem ID:            0x144d
IEEE OUI Identifier:                0x002538
Total NVM Capacity:                 2,000,398,934,016 [2.00 TB]
Unallocated NVM Capacity:           0
Controller ID:                      5
NVMe Version:                       1.3
Number of Namespaces:               1
Namespace 1 Size/Capacity:          2,000,398,934,016 [2.00 TB]
Namespace 1 Utilization:            600,020,086,784 [600 GB]
Namespace 1 Formatted LBA Size:     512
Local Time is:                      Wed Feb 28 11:29:56 2024 UTC
Firmware Updates (0x16):            3 Slots, no Reset required
Optional Admin Commands (0x0007):   Security Format Frmw_DL
Optional NVM Commands (0x001f):     Comp Wr_Unc DS_Mngmt Wr_Zero Sav/Sel_Feat
Log Page Attributes (0x03):         S/H_per_NS Cmd_Eff_Lg
Maximum Data Transfer Size:         512 Pages
Warning  Comp. Temp. Threshold:     52 Celsius
Critical Comp. Temp. Threshold:     56 Celsius

Supported Power States
St Op     Max   Active     Idle   RL RT WL WT  Ent_Lat  Ex_Lat
 0 +     4.83W       -        -    0  0  0  0        0       0
 1 +     3.54W       -        -    1  1  1  1        0       0
 2 +     3.04W       -        -    2  2  2  2        0       0
 3 -   0.0500W       -        -    3  3  3  3    26000    1000

Supported LBA Sizes (NSID 0x1)
Id Fmt  Data  Metadt  Rel_Perf
 0 +     512       0         0

=== START OF SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

SMART/Health Information (NVMe Log 0x02)
Critical Warning:                   0x00
Temperature:                        33 Celsius
Available Spare:                    100%
Available Spare Threshold:          10%
Percentage Used:                    1%
Data Units Read:                    9,996,684 [5.11 TB]
Data Units Written:                 33,014,252 [16.9 TB]
Host Read Commands:                 86,549,310
Host Write Commands:                486,670,083
Controller Busy Time:               9,927
Power Cycles:                       130
Power On Hours:                     8,011
Unsafe Shutdowns:                   104
Media and Data Integrity Errors:    0
Error Information Log Entries:      247
Warning  Comp. Temperature Time:    838
Critical Comp. Temperature Time:    0
Temperature Sensor 1:               33 Celsius
Temperature Sensor 2:               34 Celsius

Warning: NVMe Get Log truncated to 0x200 bytes, 0x200 bytes zero filled
Error Information (NVMe Log 0x01, 16 of 64 entries)
No Errors Logged

Self-tests not supported

Container core-docker.iolibraryfedora-latest exited successfully.
niekberenschot commented 4 months ago

And this is what I see on the host directly:

core@worker-1 ~ $ udevadm info /dev/sdb
P: /devices/pci0000:00/0000:00:1e.0/0000:02:1b.0/usb3/3-1/3-1:1.0/host3/target3:0:0/3:0:0:0/block/sdb
M: sdb
U: block
T: disk
D: b 8:16
N: sdb
L: 0
S: disk/by-id/usb-Samsung_PSSD_T7_[HIDDEN]-0:0
S: disk/by-diskseq/3
S: disk/by-path/pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
Q: 3
E: DEVPATH=/devices/pci0000:00/0000:00:1e.0/0000:02:1b.0/usb3/3-1/3-1:1.0/host3/target3:0:0/3:0:0:0/block/sdb
E: DEVNAME=/dev/sdb
E: DEVTYPE=disk
E: DISKSEQ=3
E: MAJOR=8
E: MINOR=16
E: SUBSYSTEM=block
E: USEC_INITIALIZED=4748926
E: MPATH_SBIN_PATH=/sbin
E: DM_MULTIPATH_DEVICE_PATH=0
E: ID_BUS=usb
E: ID_MODEL=PSSD_T7
E: ID_MODEL_ENC=PSSD\x20T7\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_MODEL_ID=4001
E: ID_SERIAL=Samsung_PSSD_T7_[HIDDEN]-0:0
E: ID_SERIAL_SHORT=[HIDDEN]
E: ID_VENDOR=Samsung
E: ID_VENDOR_ENC=Samsung\x20
E: ID_VENDOR_ID=04e8
E: ID_REVISION=0
E: ID_TYPE=disk
E: ID_INSTANCE=0:0
E: ID_USB_MODEL=PSSD_T7
E: ID_USB_MODEL_ENC=PSSD\x20T7\x20\x20\x20\x20\x20\x20\x20\x20\x20
E: ID_USB_MODEL_ID=4001
E: ID_USB_SERIAL=Samsung_PSSD_T7_[HIDDEN]-0:0
E: ID_USB_SERIAL_SHORT=[HIDDEN]
E: ID_USB_VENDOR=Samsung
E: ID_USB_VENDOR_ENC=Samsung\x20
E: ID_USB_VENDOR_ID=04e8
E: ID_USB_REVISION=0
E: ID_USB_TYPE=disk
E: ID_USB_INSTANCE=0:0
E: ID_USB_INTERFACES=:080650:080662:
E: ID_USB_INTERFACE_NUM=00
E: ID_USB_DRIVER=uas
E: ID_PATH=pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
E: ID_PATH_TAG=pci-0000_02_1b_0-usb-0_1_1_0-scsi-0_0_0_0
E: ID_FS_TYPE=ceph_bluestore
E: ID_FS_USAGE=other
E: NVME_HOST_IFACE=none
E: DEVLINKS=/dev/disk/by-id/usb-Samsung_PSSD_T7_[HIDDEN]-0:0 /dev/disk/by-diskseq/3 /dev/disk/by-path/pci-0000:02:1b.0-usb-0:1:1.0-scsi-0:0:0:0
E: TAGS=:systemd:
E: CURRENT_TAGS=:systemd:
jepio commented 4 months ago

I meant the physical USB power connection. I would only attach it through a powered USB hub, instead of powering the drive only from the USB port.

You can try one thing: add usbcore.autosuspend=-1 to the kernel commandline (through /usr/share/oem/grub.cfg). See if that helps.

niekberenschot commented 4 months ago

I will proceed with adding usbcore.autosuspend=-1. Regarding the USB hub, I need to order one first before testing that option. It does seem less likely to be the solution, especially considering that other operating systems don't exhibit the same behaviour when writing large amounts of data.

niekberenschot commented 4 months ago

I've catched one error without a shutdown:

[   49.850116] IPv6: ADDRCONF(NETDEV_CHANGE): lxc1063cc0ebe37: link becomes ready
[   49.871248] eth0: renamed from tmp75a75
[   49.884181] IPv6: ADDRCONF(NETDEV_CHANGE): lxcb5b924bd7cda: link becomes ready
[   51.921562] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[   52.444141] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[   53.005101] BTRFS warning: duplicate device /dev/sda4 devid 1 generation 31 scanned by (udev-worker) (3434)
[   69.588868] hrtimer: interrupt took 3626908 ns
[   81.849958] Key type ceph registered
[   81.850759] libceph: loaded (mon/osd proto 15/24)
[   81.864061] rbd: loaded (major 253)
[  139.458723] libceph: mon0 (1)10.107.15.168:6789 session established
[  139.461394] libceph: client139023 fsid 6d33d41d-1b76-44e8-8029-b09ee54e3e8b
[  139.587906] rbd: rbd0: capacity 32212254720 features 0x1
[  141.739099] EXT4-fs (rbd0): mounted filesystem with ordered data mode. Quota mode: none.
[  158.290177] eth0: renamed from tmpbcce0
[  158.333963] IPv6: ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
[  158.339077] IPv6: ADDRCONF(NETDEV_CHANGE): lxc62a804eafadc: link becomes ready
[  162.655814] sd 3:0:0:0: [sdb] tag#29 uas_eh_abort_handler 0 uas-tag 9 inflight: CMD OUT 
[  162.656973] sd 3:0:0:0: [sdb] tag#29 CDB: Write(10) 2a 00 00 5e 40 38 00 00 08 00
[  162.660498] sd 3:0:0:0: [sdb] tag#27 uas_eh_abort_handler 0 uas-tag 7 inflight: CMD OUT 
[  162.661612] sd 3:0:0:0: [sdb] tag#27 CDB: Write(10) 2a 00 00 5e 3c 30 00 00 08 00
[  162.665129] sd 3:0:0:0: [sdb] tag#26 uas_eh_abort_handler 0 uas-tag 8 inflight: CMD OUT 
[  162.667071] sd 3:0:0:0: [sdb] tag#26 CDB: Write(10) 2a 00 00 5e 3c 38 00 04 00 00
[  162.669810] sd 3:0:0:0: [sdb] tag#25 uas_eh_abort_handler 0 uas-tag 6 inflight: CMD OUT 
[  162.671013] sd 3:0:0:0: [sdb] tag#25 CDB: Write(10) 2a 00 00 5e 38 28 00 00 08 00
[  162.675267] sd 3:0:0:0: [sdb] tag#24 uas_eh_abort_handler 0 uas-tag 5 inflight: CMD OUT 
[  162.677964] sd 3:0:0:0: [sdb] tag#24 CDB: Write(10) 2a 00 00 5e 38 30 00 04 00 00
[  162.692564] sd 3:0:0:0: [sdb] tag#23 uas_eh_abort_handler 0 uas-tag 4 inflight: CMD OUT 
[  162.693937] sd 3:0:0:0: [sdb] tag#23 CDB: Write(10) 2a 00 00 5e 34 20 00 00 08 00
[  162.699795] sd 3:0:0:0: [sdb] tag#22 uas_eh_abort_handler 0 uas-tag 3 inflight: CMD OUT 
[  162.701112] sd 3:0:0:0: [sdb] tag#22 CDB: Write(10) 2a 00 00 5e 34 28 00 04 00 00
[  162.703248] sd 3:0:0:0: [sdb] tag#21 uas_eh_abort_handler 0 uas-tag 2 inflight: CMD OUT 
[  162.704588] sd 3:0:0:0: [sdb] tag#21 CDB: Write(10) 2a 00 00 5e 30 20 00 04 00 00
[  162.710120] sd 3:0:0:0: [sdb] tag#20 uas_eh_abort_handler 0 uas-tag 1 inflight: CMD IN 
[  162.711411] sd 3:0:0:0: [sdb] tag#20 CDB: Mode Sense(6) 1a 00 3f 00 04 00
[  162.721862] scsi host3: uas_eh_device_reset_handler start
[  163.003256] usb 3-1: reset SuperSpeed USB device number 2 using xhci_hcd
[  163.023982] scsi host3: uas_eh_device_reset_handler success
jepio commented 4 months ago

@niekberenschot I have one suggestion. Some of these usb-scsi converters are really bad at scsi compatibility, and there are cases where it's better to disable uas and rely on the usb-storage driver. May I suggested you follow these instructions https://smitchell.github.io/how-to-bind-to-the-right-usb-storage-driver and add a quirk for your device.

If i'm not mistaken it should require adding this to the kernel command line: usb-storage.quirks=04e8:4001:u.

Let me know if that helps.

niekberenschot commented 4 months ago

Thank you for the suggestion! It appears that adding the quirk to the kernel does indeed resolve the crashing issue. However, the maximum speed is significantly reduced to around 40MB/s from the previous 1000MB/s with UAS enabled. Unfortunately, this is too slow for me.