Closed aaronlevin closed 8 years ago
You use ext4? I had probably the same issue in April until I've moved to F2FS
.
I had the same issue on 2 ssd in raid 0.
@jagajaga yes, ext4
. how stable have you found F2FS
?
@aaronlevin pretty stable, 0 issues from April. Running on my laptop (intel ssd) and desktop with 2 ssd (ocz) in raid 1 nowadays.
I'm typing this from an X250 with LUKS, LVM and ext4 on SSD. NixOS since day 1, current uptime 29 days. Do you have allowDiscards = true
in boot.initrd.luks.devices
?
@mbakke I have experienced the issue with allowDiscards = true;
. It's currently turned off, so I'll turn it back on again and we'll see how quickly the issue surfaces.
@mbakke what version of the firmware on the ssd are you using? Also, anything special in the kernel modules you're loading or any extra modprobe config?
Thanks!
Hm, I have a different SSD altogether (Toshiba). Nothing else hardware-related in nixos configuration.
Device Model: TOSHIBA THNSNJ512GCSU
Firmware Version: JULA0101
User Capacity: 512,110,190,592 bytes [512 GB]
Samsung has a pretty poor track record when it comes to Linux SSD firmware.. Can you check if the same happens if you boot with libata.force=noncq
as per this bug report?
@mbakke I had the libata.force=noncq
in my extraModprobeConfig
and still had the issue.
@mbakke interesting that you have a different ssd. However, we have debian (Jesse) installed on a similar model as mine (same ssd, same firmware) and no issues.
@aaronlevin you need to set that in boot.kernelParams
, not extraModProbeConfig
. I suspect Debian may have blacklisted NCQ on that device.
@mbakke does that go in my configuration.nix
or my hardware-configuration.nix
?
Either :)
@mbakke ok, perhaps that was my issue: having libata.force=noncq
in my extraModProbe
and not kernelParams
. I've added it and I'll see how stable my ssd is now.
Is there a policy around closing and re-opening? Because I'm happy to close this now and then re-open it if does not fix the issue.
Additionally, should we possibly consider generating that kernel param on detection of this ssd similar to debian? Or is that out of scope for NixOS to determine such a setting?
If that indeed solves the issue, we can probably add the NCQ TRIM blacklist patch from above, at least if other distros are doing the same. Although it arguably should be added upstream...
(disclaimer: I don't actually know what NixOS' stance on adding kernel patches is, nor am I an official dev)
@mbakke hmm, good question. We may not even need to apply the patch. For example, it might even be easier just to generate:
{ config, pkgs, ... }:
{
# We have detected an SSD with NCQ TRIM blacklisted.
boot.kernelParams = [ "libata.force=noncq" ];
}
Is there a policy around closing and re-opening? Because I'm happy to close this now and then re-open it if does not fix the issue.
I don't think there's such a policy. I would do the same as you.
@vcunat thanks.
After 2 days of stability (longest so far), I just hit the issue again. Ugh. I was really hoping that would resolve my issue.
To update: I tried putting libata.force=noncq
in my kernelParams
. This brought about some stability but I hit the issue after 48 hours.
Can anyone think of any other settings that a distro like debian
might set for these SSDs that NixOS is not setting?
I don't suppose you were able to take a screendump this time? The errors should be somewhat different. Could perhaps try mounting /var/log/journal
on a USB stick, or send it over network. You can also disable NCQ runtime with echo 1 > /sys/block/sda/device/queue_depth
.
Remove discard
from fstab too to make sure we don't hit multiple bugs. I'm on BIOS 1.15 FWIW, although we should both upgrade to 1.17. Debian follows kernel development closely and may apply all kinds of workarounds that are fixed in newer firmware.
I'm also seeing this issue on a SuperMicro SuperServer 5017R-WRF with Samsung 840 EVO SSDs, both a 250GB and a 500GB model. I'm running NixOS 15.09 and ext4
. The same hardware runs Ubuntu 14.04 with no issues.
In my case, the controller resets the drive enough times that it ends up in UDMA/133 mode and appears to be stable, but I've only just managed to get nixos-install
to finish, so I haven't taxed it much yet.
I added libata.force=noncq
on the NixOS installer grub command line and got fewer errors than without that boot parameter, but still got them.
(BTW, I am not running luks in my configuration -- it's just straight ext4
.)
@mbakke I just want to preface my answers by thanking you for all your help!!
NCQ looks properly disabled and I don't have discards
in my fstab
(listed below). Is there a simple way to upgrade the firmware?
☭ cat /sys/block/sda/device/queue_depth
1
here is my hardware-configuration.nix
:
{ config, lib, pkgs, ... }:
{
imports =
[ <nixpkgs/nixos/modules/installer/scan/not-detected.nix>
];
boot.initrd.availableKernelModules = [ "xhci_pci" "ehci_pci" "ahci" "usbhid" "usb_storage" ];
boot.kernelModules = [ "kvm-intel" ];
boot.extraModulePackages = [ ];
fileSystems."/" =
{ device = "/dev/disk/by-uuid/bb6a6acb-055e-4d1f-9812-13c9d183bb6c";
fsType = "ext4";
options = "rw,relatime,nobarrier,data=ordered";
};
fileSystems."/boot" =
{ device = "/dev/disk/by-uuid/9cbf3855-bb29-4123-abd1-e08de2e39a36";
fsType = "ext2";
};
swapDevices =
[ { device = "/dev/disk/by-uuid/73bfe4e8-b4a6-433b-b152-73fd5702fcd8"; }
];
nix.maxJobs = 4;
}
@dhess can you run smartctl -a
on your drive? Curious what firmware you have running.
@dhess: Your device is explicitly blacklisted in kernel 4.1 or newer. Try setting boot.kernelPackages = pkgs.linuxPackages_4_1
and see if the problem persists. It may be easier to switch to the unstable channel if you need to compile stuff: nix-channel --add https://nixos.org/channels/nixos-unstable nixos
.
@aaronlevin if you still have an X250 with Debian around, check if the backports kernel has the same issue (apt-get -t jessie-backports install linux-image-amd64
). Jessie and 14.04 are both kernel 3.16, while 15.09 is 3.18. Could you also try booting 4.2 or newer with libata.force=noncqtrim
instead of libata.force=noncq
.
If you are on stock BIOS still, I recommend upgrading. These are first generation Broadwells after all. Download the (Windows) ISO-based installer (n10ur08w.exe IIRC) from Lenovo, extract the .iso with innoextract
, convert to image with geteltorito.pl and dd onto USB stick. Instructions here.
Yes, that appears to be exactly what's happening with the newest Samsung firmware.
Based on Samsung's responses to the bug, I think that, rather than upgrading my kernel, I'll just take my business elsewhere!
Thanks for the help.
@mbakke the Debian system in question was on kernel 4.1
, but I will try booting with libata.force=noncqtrim
instead and see if the problem persists. I hit the issue this morning, so it's happening with some regular frequency.
I'll also try upgrading the firmware but that might not happen until later this evening.
Thanks again!
PS - I am on the 4.3
kernel.
@mbakke do I need to have "libata"
in my boot.initrd.kernelModules
? It is not there currently.
@aaronlevin the libata module is loaded automatically when needed. I assumed it was compiled-in, but since it's built as a module the libata.force
options should work in extraModProbeConfig
too.
@mbakke just to make sure I have everything correct:
in configuration.nix
I have: boot.kernelParams = [ "libata.force=noncqtrim" ];
in hardware-configuration.nix
I have:
boot.initrd.availableKernelModules = [ "xhci_pci" "ehci_pci" "ahci" "usbhid" "usb_storage" ];
boot.kernelModules = [ "kvm-intel" ];
boot.extraModulePackages = [ ];
So, I'm not explicitly specifying the libata
module needs to be loaded, though it appears in my kernelParams
. I would assume I should have to specify it in boot.initrd.kernelModules
?
You can put libata.force
in either kernel command line or as a module parameter, it should be picked up either way. The libata module is loaded automatically (but it's arguably more nix-y to specify it).
Hmm, I'm now getting this same error using an Intel 535 120GB SSD, which is not on the blacklist.
(edit: yep, I can't even get through the nixos-install
process with this Intel 535. Setting libata.force=noncq
on the Grub boot line for the USB boot disk makes no difference.)
I just hit the issue right now again. I've added "libata"
to my boot.initrd.kernelModules
and rebooted. we'll see.
I'm back on the Samsung 840 EVO 250GB SSD again. I managed to get the 4.1 kernel installed by putting the SATA controller in IDE mode in the system BIOS. I then rebooted and set the controller back to AHCI mode. Unfortunately, even in 4.1, I'm getting NCQ errors:
Dec 02 10:09:52 nix01 kernel: ata2.00: exception Emask 0x10 SAct 0x3fc0 SErr 0x4040000 action 0xe frozen
Dec 02 10:09:57 nix01 kernel: ata2.00: irq_stat 0x00000040, connection status changed
Dec 02 10:09:57 nix01 kernel: ata2: SError: { CommWake DevExch }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:30:00:00:20/00:00:08:00:00/40 tag 6 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:38:18:00:20/00:00:08:00:00/40 tag 7 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:40:80:00:20/00:00:0c:00:00/40 tag 8 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:48:f0:05:20/00:00:0c:00:00/40 tag 9 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:50:98:06:20/00:00:0c:00:00/40 tag 10 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:58:f8:13:21/00:00:0c:00:00/40 tag 11 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:60:08:01:a0/00:00:19:00:00/40 tag 12 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:68:28:00:a4/00:00:19:00:00/40 tag 13 ncq 4096 out
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link down (SStatus 1 SControl 300)
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link down (SStatus 1 SControl 300)
Dec 02 10:09:57 nix01 kernel: ata2: limiting SATA link speed to 1.5 Gbps
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 02 10:09:57 nix01 kernel: ata2.00: supports DRM functions and may not be fully accessible
Dec 02 10:09:57 nix01 kernel: ata2.00: disabling queued TRIM support
Dec 02 10:09:57 nix01 kernel: ata2.00: supports DRM functions and may not be fully accessible
Dec 02 10:09:57 nix01 kernel: ata2.00: disabling queued TRIM support
Dec 02 10:09:57 nix01 kernel: ata2.00: configured for UDMA/133
Dec 02 10:09:57 nix01 kernel: ata2: EH complete
Bizarre. Can't believe this is NixOS-specific. @dhess make sure you have the latest firmware on both drives. Intel have had similar problems in the past.
I actually have a remote machine that "died" with similar disk errors some time during/after 3.18. It may well have been the same issue, will check on it tomorrow. Hopefully there are some earlier generations left.
At this point I would try different combinations of [no]ncq
and [no]ncqtrim
(latter requires kernel 4.2+) to libata.force
.. Note you can pass them from grub/gummiboot rather than rebuild all the time. Also verify that the options are actually picked up (dmesg?).
I'll have a look through the Debian kernel sources and try to find anything remotely related.
I did verify that 'noncq' is reflected in dmesg when specified on the Grub command line.
I do have the latest firmware on the Samsung drive as I just flashed it last night before my most recent attempts to use it. It didn't make any difference.
I've reinstalled Ubuntu 14.04 and will stress-test the machine for a few hours to see if I get any NCQ errors. I believe that version of Ubuntu is running 3.13, so I may upgrade to whichever version of Ubuntu (or maybe Jessie) has a 3.18 or later kernel so the comparison is more relevant.
As a test, I compiled GHC 7.10.2 from source using make -j12
on the Samsung 840 EVO (latest (EXT0DB6Q) firmware), on 3 different versions of Ubuntu: 14.04, 15.04, and 15.10. I didn't add any kernel command-line options or otherwise try to manually disable NCQ or TRIM support.
GHC built successfully on 14.04 and 15.04 with no SATA/NCQ issues. It's in the process of building on 15.10 as I write this, but I don't feel like more evidence is needed at this point. The system also handled the 14.04 install and 2 full distro upgrades (14.04 -> 15.04, 15.04 -> 15.10) perfectly. I'm now convinced this is not a hardware issue. In NixOS 15.09 with the same hardware, I get NCQ errors as soon as I do anything as simple as editing a file.
Here are the results of dmesg | grep -i ncq
on the two most recent Ubuntu versions:
Ubuntu 15.04 (linux-image-generic 3.19.0.37.36):
[ 2.347661] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst
[ 2.707964] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
Ubuntu 15.10 (linux-image-generic 4.2.0.19.21):
[ 2.459704] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst
[ 2.820158] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
No mention of "horkage" in any logs or in dmesg
.
I will probably set the SATA mode to IDE in BIOS and use NixOS that way until this is resolved.
I don't suppose there's any easy way to use a Debian/Ubuntu kernel with NixOS, is there? I Googled a bit but came up empty.
For reference, here are the outputs of lsmod
and smartctl -a /dev/sda
in Ubuntu 15.10:
Module Size Used by
binfmt_misc 20480 1
ipmi_ssif 24576 0
intel_rapl 20480 0
iosf_mbi 16384 1 intel_rapl
x86_pkg_temp_thermal 16384 0
intel_powerclamp 16384 0
coretemp 16384 0
kvm_intel 167936 0
kvm 512000 1 kvm_intel
crct10dif_pclmul 16384 0
crc32_pclmul 16384 0
aesni_intel 167936 0
aes_x86_64 20480 1 aesni_intel
lrw 16384 1 aesni_intel
gf128mul 16384 1 lrw
glue_helper 16384 1 aesni_intel
ablk_helper 16384 1 aesni_intel
cryptd 20480 2 aesni_intel,ablk_helper
sb_edac 28672 0
edac_core 53248 1 sb_edac
input_leds 16384 0
joydev 20480 0
mei_me 32768 0
shpchp 36864 0
mei 98304 1 mei_me
lpc_ich 24576 0
ioatdma 65536 0
ipmi_si 57344 0
8250_fintek 16384 0
ipmi_msghandler 49152 2 ipmi_ssif,ipmi_si
mac_hid 16384 0
lp 20480 0
parport 49152 1 lp
autofs4 40960 2
hid_generic 16384 0
usbhid 49152 0
hid 118784 2 hid_generic,usbhid
igb 188416 0
dca 16384 2 igb,ioatdma
ahci 36864 4
ptp 20480 1 igb
libahci 32768 1 ahci
pps_core 20480 1 ptp
i2c_algo_bit 16384 1 igb
wmi 20480 0
smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.0-19-generic] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org
=== START OF INFORMATION SECTION ===
Model Family: Samsung based SSDs
Device Model: Samsung SSD 840 EVO 250GB
Serial Number: S1DBNEAD714949K
LU WWN Device Id: 5 002538 8500158f2
Firmware Version: EXT0DB6Q
User Capacity: 250,059,350,016 bytes [250 GB]
Sector Size: 512 bytes logical/physical
Rotation Rate: Solid State Device
Device is: In smartctl database [for details use: -P show]
ATA Version is: ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is: SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is: Wed Dec 2 22:24:52 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled
=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED
General SMART Values:
Offline data collection status: (0x00) Offline data collection activity
was never started.
Auto Offline Data Collection: Disabled.
Self-test execution status: ( 0) The previous self-test routine completed
without error or no self-test has ever
been run.
Total time to complete Offline
data collection: ( 4800) seconds.
Offline data collection
capabilities: (0x53) SMART execute Offline immediate.
Auto Offline data collection on/off support.
Suspend Offline collection upon new
command.
No Offline surface scan supported.
Self-test supported.
No Conveyance Self-test supported.
Selective Self-test supported.
SMART capabilities: (0x0003) Saves SMART data before entering
power-saving mode.
Supports SMART auto save timer.
Error logging capability: (0x01) Error logging supported.
General Purpose Logging supported.
Short self-test routine
recommended polling time: ( 2) minutes.
Extended self-test routine
recommended polling time: ( 80) minutes.
SCT capabilities: (0x003d) SCT Status supported.
SCT Error Recovery Control supported.
SCT Feature Control supported.
SCT Data Table supported.
SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE UPDATED WHEN_FAILED RAW_VALUE
5 Reallocated_Sector_Ct 0x0033 100 100 010 Pre-fail Always - 0
9 Power_On_Hours 0x0032 096 096 000 Old_age Always - 17820
12 Power_Cycle_Count 0x0032 099 099 000 Old_age Always - 27
177 Wear_Leveling_Count 0x0013 098 098 000 Pre-fail Always - 13
179 Used_Rsvd_Blk_Cnt_Tot 0x0013 100 100 010 Pre-fail Always - 0
181 Program_Fail_Cnt_Total 0x0032 100 100 010 Old_age Always - 0
182 Erase_Fail_Count_Total 0x0032 100 100 010 Old_age Always - 0
183 Runtime_Bad_Block 0x0013 100 100 010 Pre-fail Always - 0
187 Uncorrectable_Error_Cnt 0x0032 100 100 000 Old_age Always - 0
190 Airflow_Temperature_Cel 0x0032 062 062 000 Old_age Always - 38
195 ECC_Error_Rate 0x001a 200 200 000 Old_age Always - 0
199 CRC_Error_Count 0x003e 099 099 000 Old_age Always - 421
235 POR_Recovery_Count 0x0012 099 099 000 Old_age Always - 21
241 Total_LBAs_Written 0x0032 099 099 000 Old_age Always - 9225534273
SMART Error Log Version: 1
No Errors Logged
SMART Self-test log structure revision number 1
No self-tests have been logged. [To run self-tests, use: smartctl -t]
SMART Selective self-test log data structure revision number 1
SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS
1 0 0 Not_testing
2 0 0 Not_testing
3 0 0 Not_testing
4 0 0 Not_testing
5 0 0 Not_testing
Selective self-test flags (0x0):
After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.
Hit the issue again, despite forcing the presence of libata
module.
[38621.166800] ata1.00: exception Emask 0x0 SAct 0xe00000 SErr 0x50000 action 0x6 frozen
[38621.166803] ata1: SError: { PHYRdyChg CommWake }
[38621.166805] ata1.00: failed command: WRITE FPDMA QUEUED
[38621.166808] ata1.00: cmd 61/08:a8:c0:78:e5/00:00:12:00:00/40 tag 21 ncq 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166809] ata1.00: status: { DRDY }
[38621.166810] ata1.00: failed command: WRITE FPDMA QUEUED
[38621.166813] ata1.00: cmd 61/08:b0:f8:78:e5/00:00:12:00:00/40 tag 22 ncq 4096 out
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166814] ata1.00: status: { DRDY }
[38621.166815] ata1.00: failed command: READ FPDMA QUEUED
[38621.166817] ata1.00: cmd 60/08:b8:08:28:60/00:00:02:00:00/40 tag 23 ncq 4096 in
res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166818] ata1.00: status: { DRDY }
It looks like the kernel param is successfully passed during boot. However, it seems like noncqtrim
is not being respected? Full output below, but there are two suspicious lines:
[ 0.289695] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
+
[ 0.597066] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
Is this expected?
full output:
☭ sudo dmesg -k | grep -i cq
[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de init=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim
[ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de init=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim
[ 0.017667] ACPI: 12 ACPI AML tables successfully acquired and loaded
[ 0.289695] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
[ 0.596111] ata1.00: FORCE: horkage modified (noncqtrim)
[ 0.597066] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[ 14.158037] rtsx_pci 0000:02:00.0: rtsx_pci_acquire_irq: pcr->msi_en = 1, pci->irq = 44
I added libata.force=noncqtrim,noncq
to my kernelParams
and now I'm seeing NCQ
disabled:
☭ dmesg -k | grep -i cq
[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de init=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim,noncq
[ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de init=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim,noncq
[ 0.017666] ACPI: 12 ACPI AML tables successfully acquired and loaded
[ 0.297387] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst
[ 0.604122] ata1.00: FORCE: horkage modified (noncqtrim)
[ 0.604125] ata1.00: FORCE: horkage modified (noncq)
[ 0.604651] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (not used)
[ 9.653970] rtsx_pci 0000:02:00.0: rtsx_pci_acquire_irq: pcr->msi_en = 1, pci->irq = 45
We'll see how long this is stable for.
I've tried the same with AHCI mode turned back on in the BIOS, using a Linux 4.2 kernel:
[ 0.000000] Command line: BOOT_IMAGE=(hd0,msdos2)/nix/store/f77jdmsx27a81qkrfvmz7hjh5c83cwkm-linux-4.2.5/bzImage systemConfig=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de init=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de/init libata.force=noncqtrim,noncq loglevel=4
[ 0.000000] Kernel command line: BOOT_IMAGE=(hd0,msdos2)/nix/store/f77jdmsx27a81qkrfvmz7hjh5c83cwkm-linux-4.2.5/bzImage systemConfig=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de init=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de/init libata.force=noncqtrim,noncq loglevel=4
[ 0.058802] ACPI: All ACPI Tables successfully acquired
[ 0.828099] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst
[ 1.145191] ata2.00: FORCE: horkage modified (noncq)
[ 1.145215] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (not used)
Unfortunately, this causes a different error:
[ 31.413338] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[ 31.413368] ata2.00: irq_stat 0x00000040, connection status changed
[ 31.413390] ata2: SError: { CommWake DevExch }
[ 31.413407] ata2.00: failed command: READ DMA EXT
[ 31.413426] ata2.00: cmd 25/00:08:d0:ba:24/00:00:1b:00:00/e0 tag 7 dma 4096 in
res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[ 31.413475] ata2.00: status: { DRDY }
[ 31.413491] ata2: hard resetting link
[ 33.618097] ata2: SATA link down (SStatus 1 SControl 300)
[ 33.973001] ata2: hard resetting link
[ 36.179834] ata2: SATA link down (SStatus 1 SControl 300)
[ 36.179842] ata2: limiting SATA link speed to 1.5 Gbps
[ 36.474890] ata2: hard resetting link
[ 36.779773] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[ 36.781716] ata2.00: supports DRM functions and may not be fully accessible
[ 36.781889] ata2.00: supports DRM functions and may not be fully accessible
[ 36.781892] ata2.00: configured for UDMA/133
[ 36.792796] ata2: EH complete
Tried the 4.3 kernel, same result, this time with a WRITE DMA EXT error forcing the controller into UDMA/133 mode.
@dhess how are you forcing these issues to happen so quickly? I only hit this after several hours (and occasionally days)
@aaronlevin For me it reliably happens only a few seconds after login. Just lucky I guess :\
@dhess :(
@dhess to run your system with the ssd in IDE
mode, did you have to re-generate hardware-configuration.nix
?
No. I just made the BIOS change. Everything is working great now; it's just a shame I had to cripple the SSD performance to get here.
@dhess your last error is closer to what I had on the mentioned remote system (also Supermicro). Didn't get to look at it yet, but can you post lspci -nv? Curious which disk controller you have.
Can we remove the needs: feedbacck
tag on this?
I've been experiencing very frequent file system / ssd failures on my Lenovo X250. I've experienced this problem:
3.18
,4.0
,4.1
,4.2
, and4.3
TRIM
enabled (and all thefstab
incanations:noatime,nodirtime,discards
, etc.I am fairly certain this is a configuration issue. Specifically, NixOS likely isn't installing the right modules or setting the right configuration or having an option to grab the right firmware?
I'm really desperate to fix this problem so I can use NixOS at work.
The Issue
It is hard to re-create and hard to debug. There is a
kernel
failure, the file system will hang, and get remounted inread-only
mode (so I can't grab logs). The error happens intermittently. The longest period I've gone without experiencing the issue is a day and a half, but I've experienced it immediately on reboot a few times. It seems to happen more frequently after suspend, but not exclusively (I hit it one right after a reboot - I also hit it once duringsystemctl rescue
.I have taken photos of the failure on my cell phone (sometimes I can catch logs while tailing
dmesg
when the error happens).from
dmesg
(I'm writing out what I see on my photo):My Setup
luks
encrypted drive. (setup following these instructions)