NUC10i7FNK - nvme abort error on 33700 with kernel 5.8

amonforstmann commented 3 years ago

Hi there,

every time I boot version 33700 with kernel 5.8.8-984 on my NUC10i7FNK I get nvme abort errors and the system freezes completely.

SSD: Intel SSDPEKNW020T8

There are no problems with kernel 5.7.19-982.native

[Sep12 13:16] nvme nvme0: I/O 0 QID 4 timeout, aborting
[  +0,000009] nvme nvme0: I/O 1 QID 4 timeout, aborting
[  +0,000002] nvme nvme0: I/O 2 QID 4 timeout, aborting
[  +0,000003] nvme nvme0: I/O 3 QID 4 timeout, aborting
[  +0,000060] nvme nvme0: Abort status: 0x0

Steps to reproduce: sudo swupd diagnose

miguelinux commented 3 years ago

Hi @amonforstmann

Can you try to update the NUC BIOS Firmware, and try again please.

amonforstmann commented 3 years ago

Hi @miguelinux

I updated from FNCML357.0039 to FNCML357.0045 but the error remains.

Even worse... my working kernel is now gone from the boot menu...

ZVNexus commented 3 years ago

@miguelinux I can confirm this issue on my "SB-ROCKET-NVMe4-1TB" as well.

amonforstmann commented 3 years ago

To fix the issue I tried to add the kernel-lts bundle.

After multiple clr-boot-manager set-kernel ... and clr-boot-manager update calls the only kernel that boots is 5.8. The loader.conf value is ignored...

amonforstmann commented 3 years ago

I can at least confirm that version 33700 with kernel 5.8 works flawlessly on my Threadripper machine with Gigabyte GP-ASM2NE6100TTTD SSDs

Polish-Civil commented 3 years ago

Hello, i can confirm that i started experiencing issues since i upgraded the system. The issues started to happen on kernel 5.8.9, the previous one i had is 5.6.8 and it were working just fine on that version.

I think it might be related with that specific Linux kernel they AFAIK have been working on nvme devices recently. My drive is: INTEL SSDPEKNW512G8 (002C) My device is: ASUS Zenbook UX433FC (with latest bios)

Not sure about the specific error as i just glanced over it but it went like i was doing something, then system froze and inside terminal it started printing "Input/Output" error then on dmesg I've seen some errors regarding fs and blocks like read_inode or something like that, it usually goes to the error state after 5 min of working on random stuff like browsing internet/coding etc.

amonforstmann commented 3 years ago

Not sure about the specific error as i just glanced over it but it went like i was doing something, then system froze and inside terminal it started printing "Input/Output" error then on dmesg I've seen some errors regarding fs and blocks like read_inode or something like that, it usually goes to the error state after 5 min of working on random stuff like browsing internet/coding etc.

Yes, this is exactly what happens! There was no way to save the output so I took a picture in a hurry ->

Polish-Civil commented 3 years ago

Not sure about the specific error as i just glanced over it but it went like i was doing something, then system froze and inside terminal it started printing "Input/Output" error then on dmesg I've seen some errors regarding fs and blocks like read_inode or something like that, it usually goes to the error state after 5 min of working on random stuff like browsing internet/coding etc.

Yes, this is exactly what happens! There was no way to save the output so I took a picture in a hurry ->

Ah yes APST feature.

I was googling for a bit online to know a little bit more about the issues, it looks like there is some bug with power management of the nvme devices? something like that.

https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe#Samsung_drive_errors_on_Linux_4.10

Arch linux has some mention and the effect of this bug seems to produce the same errors, but i've tried to set that param and it still happens, really not sure what is causing it, maybe i should use different values on nvme_core.default_ps_max_latency_us=?

amonforstmann commented 3 years ago

This one seems to be resolved with 5.8.11-988.native on 33760

amonforstmann commented 3 years ago

Issue remains :-/

slootjes commented 3 years ago

I have similar issues with an Intel NUC10i5FNK. It boots normally, commands work fine but then trying to install a package using "sudo swupd bundle-add docker-compose" it locks up at copying files at 50%, this is the last bit of output:

Error: failed to rename staged 8b3c83c5a076c81d106c422eb10178b168f63e7fbac184f1e0c462483de73d78 to final: Input/output error Error: failed to rename staged 602f793cd03a40bc13f8be7be9dfa2953fe7cb96185d33678e2018bb9cdf2a61 to final: Input/output error Error: failed to rename staged 8f8866a20b41b1f51451706a49ffedfb6d008449e24a4e3064cce386cb66baa1 to final: Input/output error Error: failed to rename staged c556015912f0459f03240ce49453677505d0236c7d3883ed35dcbdb1208dae6b to final: Input/output error [100%]

Error: Failed to install required files Failed to install 1 of 1 bundles Error: Failed to create error report for (null)

After that, every command results into "dmesg: command not found" and I need to reboot. After a reboot it is normal again until trying to install the package again.

My drive is a "Kingston A2000 1000GB M.2 SSD", fsck does not show any errors, memtest86 also does not show any errors.

Update: I've decided to use a different distro which runs without issues.

hsehdar commented 3 years ago

Facing this issue.

Environment

Gigabyte board with Intel® H110 Chipset
Intel® Core™ i5-7400 Processor
Apacer DDR4 2400mhz 16GB RAM
A cheap adapter: PCI Express 3.0 x16 to NVME SSD
Intel® SSD 600p Series

What was the activity?

Install Clear Linux 33590 using live desktop ISO from a USB drive.
Boot and login.
Upgrade to the latest 33780
Keep using the computer.

What was observed?

OS hangs and slowly deteriorates to forceful manual reboot state.

Other observations

Deepin Linux works flawlessly with kernel 5.7
Elementary OS works flawlessly with kernel 5.4

If community has any feedback (like specific version to revert) then do let me know.

hsehdar commented 3 years ago

Clear Linux kernel-lts is working normally. I had this issue with kernel-native.

During fresh installation chose kernel-lts as the default.

insilications commented 3 years ago

MSI MEG-Z490-ACE
Intel® Core™ i7-10700K Processor
Corsair 32GB Dominator
970 EVO Plus NVMe M.2 SSD 500GB
Had Clear Linux 336xx with kernel 5.7.x
Upgrade to Clear Linux 33760 (clr 5.8.11-988.native)

What was observed?

Performing some major I/O operations (fstrim, search files) results in performance degradation until the NVMe controller resets:

[ 3420.531510] nvme nvme1: I/O 0 QID 16 timeout, aborting
[ 3420.531512] nvme nvme1: I/O 1 QID 16 timeout, aborting
[ 3420.531514] nvme nvme1: I/O 2 QID 16 timeout, aborting
[ 3420.531516] nvme nvme1: I/O 4 QID 16 timeout, aborting
[ 3420.531517] nvme nvme1: I/O 8 QID 16 timeout, aborting
[ 3420.531519] nvme nvme1: I/O 9 QID 16 timeout, aborting
[ 3420.531521] nvme nvme1: I/O 10 QID 16 timeout, aborting
[ 3420.531523] nvme nvme1: I/O 12 QID 16 timeout, aborting
[ 3420.531645] nvme nvme1: Abort status: 0x0
[ 3420.531650] nvme nvme1: I/O 1017 QID 16 timeout, aborting
[ 3420.532123] nvme nvme1: Abort status: 0x0
[ 3420.532124] nvme nvme1: Abort status: 0x0
[ 3420.532125] nvme nvme1: Abort status: 0x0
[ 3420.532125] nvme nvme1: Abort status: 0x0
[ 3420.532125] nvme nvme1: Abort status: 0x0
[ 3420.532126] nvme nvme1: Abort status: 0x0
[ 3420.532126] nvme nvme1: Abort status: 0x0
[ 3420.532126] nvme nvme1: Abort status: 0x0
[ 3450.547466] nvme nvme1: I/O 0 QID 16 timeout, reset controller
[ 3450.599631] blk_update_request: I/O error, dev nvme1n1, sector 6001272 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.610956] blk_update_request: I/O error, dev nvme1n1, sector 8061160 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.622266] blk_update_request: I/O error, dev nvme1n1, sector 5999864 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.633521] blk_update_request: I/O error, dev nvme1n1, sector 5989704 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.644784] blk_update_request: I/O error, dev nvme1n1, sector 6000072 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.656160] blk_update_request: I/O error, dev nvme1n1, sector 5988808 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.667632] blk_update_request: I/O error, dev nvme1n1, sector 5988760 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.679077] blk_update_request: I/O error, dev nvme1n1, sector 6001024 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.690371] blk_update_request: I/O error, dev nvme1n1, sector 8062312 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.701612] blk_update_request: I/O error, dev nvme1n1, sector 5989632 op 0x0:(READ) flags 0x80700 phys_seg 1 prio class 0
[ 3450.715784] nvme nvme1: Shutdown timeout set to 8 seconds
[ 3450.731991] nvme nvme1: 16/0/0 default/read/poll queues

What seemed to solve the problem SO FAR

1) Disabled clr-power.service (clr_power tweaks). 2) After some googling, people with similar problem recommended disabling all kinds of power manegement settings for nvme_core module and pcie power settings in kernel parameters:

pcie_aspm.policy=performance
pcie_aspm=off
pcie_port_pm=off
nvme_core.default_ps_max_latency_us=0
nvme_core.io_timeout=255
nvme_core.max_retries=10
nvme_core.shutdown_timeout=10

3) Disabled Autonomous Power State Transition Enable (APSTE) in the NVMe device: sudo nvme set-feature -f 0x0c -v 0 /dev/nvme1

https://wiki.archlinux.org/index.php/Solid_state_drive/NVMe

4) Also ran fsck.f2fs on that drive and found no errors.

5) Still using theses fixes under Clear Linux 33760 (clr 5.8.11-988.native) and no problem (so far).

Other observations

The same NVMe works flawlessly in Windows 10 in another partition. Unlikely to be hardware related.
It worked perfectly with kernel 5.7.x

hsehdar commented 3 years ago

Now on 5.8.14-991.native with NVME and Clear Linux OS 33820 is working normally.

Let me monitor for few days.

hsehdar commented 3 years ago

Oops! The machine hung once this morning. I powered off forcefully and turned it on with LTS2019 kernel and it is working normally.

jkeli commented 3 years ago

I updated a system to 33930 this morning and am still seeing this issue as a reproducible system crash. bundle-adding just about any bundle will crash the entire system when the nvme storage fails. I don't think I can use the LTS kernel as the NUC I am on has an Intel wired ethernet adapter that is too new / not supported by the LTS kernel, and using a different ethernet adapter is simply not an option.

Frustratingly enough, bundle-adding storage-utils to get the nvme command failed repeatedly at 50% installing files for dozens of attempts this morning, but eventually I got it installed. I'm currently running the kernel parameters listed above to try to work around the show-stoppingly broken nvme support, along with the set-feature command to try to disable the autonomous power state "feature".

This is a really frustrating problem, mostly because from reading around, other distributions don't have this issue.

sebastiencs commented 3 years ago

I have the same issue on my laptop XPS 15 9500 with kernel 5.9.6-998.native. The log are the same than posted by @insilications but often the system becomes unusable and I have to restart it.

Polish-Civil commented 3 years ago

Hello again, i've been recently checking out the new kernel updates and day has come, i think the issue at least for me is fixed on 5.9.9-1001.native

Haven't noticed issues so far.

hsehdar commented 3 years ago

Thanks, @Polish-Civil for the update. Would be trying the native kernel soon.

hsehdar commented 3 years ago

This is to confirm that with Google Chrome running on 5.9.9-1001.native kernel is working fine.

Also, the Gnome suspend or sleep mode works fine.

RevAngel7 commented 1 year ago

Hello. This seems to be still an issue on later kernels. I am using 5.19 and have that issue from 5.19.1 to the now recent, and used by me, 5.19.11 from ubuntu 22.04 mainline.

nvme nvme0: Abort status: 0x0 nvme nvme0: I/O 14 QID 2 timeout, aborting nvme nvme0: Abort status: 0x0 nvme nvme0: I/O 62 QID 2 timeout, aborting and so on...

I have these issues after I changed from a AMD 2400G with an AM3 AGESA (on X370 chipset) before 1.0.0.6 to a AMD 5600G on a AM3 with AGESA 1.2.0.7 (on A520 chipset). NVME drive is the same.

I also get an error message at bootup from the NVME: Device: /dev/nvme0, number of Error Log entries increased from 203 to 206 This counter rises +1 every poweroff (203-206 comes from an image backup I did with 3 manual power off's, so it seems every power off counts as one error count).

The NVME drive never breached the high temp count and shows zero errors and very little wear on SMART tests, since I use this drive for a simple "daily use" multimedia system.

When the "nvme nvme0: Abort status: 0x0" / "nvme nvme0: I/O 14 QID 2 timeout, aborting" errors occur, the system hangs for a while, no I/O operations get processed for up to 30 seconds.

Since this entry is from 2020 and still an issue in late 2022, is there a global fix or a planned fix for the user or planned to kernel changes?

Thank you for any reply.

fenrus75 commented 1 year ago

Hello. This seems to be still an issue on later kernels. I am using 5.19 and have that issue from 5.19.1 to the now recent, and used by me, 5.19.11 from ubuntu 22.04 mainline.

Since this is the tracker for the Clear Linux distribution, it might be better to report the issue with the Ubuntu kernel to the Ubuntu bug reporting forum....

RevAngel7 commented 1 year ago

Hello. This seems to be still an issue on later kernels. I am using 5.19 and have that issue from 5.19.1 to the now recent, and used by me, 5.19.11 from ubuntu 22.04 mainline.

Since this is the tracker for the Clear Linux distribution, it might be better to report the issue with the Ubuntu kernel to the Ubuntu bug reporting forum....

I totally get why my report is out of place. I really do. And since I consider myself more of a user than a tech savy person I also understand the reluctance to consider my comment a real issue.

This bug is still open, if I am reading it right.

The same issue on https://github.com/vmware/open-vm-tools/issues/579 , also unsolved.

The same issue on https://github.com/clearlinux/distribution/issues/2121 , also unsolved.

And there is my issue on ubuntu.

Three different kernels, linux brands, same issue. I thought bringing the people together who actually have the tech knowledge to get behind this issue might be helpful (but that's just me).

RevAngel7 commented 1 year ago

Posted it on https://bugs.launchpad.net/launchpad/+bug/1991291 fyi

clearlinux / distribution