File System / SSD Failures on Lenovo X250 (not kernel, not hardware)

aaronlevin commented 8 years ago

I've been experiencing very frequent file system / ssd failures on my Lenovo X250. I've experienced this problem:

on two separate, brand new Lenovo x250s
had a Lenovo x250 working with Debian (Jesse) with no issues for months (same ssd, same firmware)
across several kernels: 3.18, 4.0, 4.1, 4.2, and 4.3
with or without TRIM enabled (and all the fstab incanations: noatime,nodirtime,discards, etc.

I am fairly certain this is a configuration issue. Specifically, NixOS likely isn't installing the right modules or setting the right configuration or having an option to grab the right firmware?

I'm really desperate to fix this problem so I can use NixOS at work.

The Issue

It is hard to re-create and hard to debug. There is a kernel failure, the file system will hang, and get remounted in read-only mode (so I can't grab logs). The error happens intermittently. The longest period I've gone without experiencing the issue is a day and a half, but I've experienced it immediately on reboot a few times. It seems to happen more frequently after suspend, but not exclusively (I hit it one right after a reboot - I also hit it once during systemctl rescue.

I have taken photos of the failure on my cell phone (sometimes I can catch logs while tailing dmesg when the error happens).

from dmesg (I'm writing out what I see on my photo):

ata1.00: exception Emask 0x0 SAct 0x1e0000 SErr 0x50000 action 0x6 frozen
ata1: SError: { PHYRdyChg CommWake}
ata1.00: failed command: WRITE FPDMA QUEUED
ata1.00: cmd 61/08:00:18:ee:7a/00:00:02:00:00/40 tag 0 ncq .... (and on)
ata1.00: status: { DRDY }
...
(and repeat)
...
ata1: hard resetting link
ata1: SATA link down (SStatus 0 SControl 300)
...
(repeat)
...
ata1.00: disabled
ata1.00: device reported invalid CHS sector 0
sd 0:0:0:0: [sda] tag#21 FAILED Result: hostbye=DID_OK drivebyte=DRIVER_SENSE
sd 0:0:0:0: [sda] tag#21 Sense Key : Illegal Request [current] [descriptor]
sd 0:0:0:0: [sda] tag#21 AddI Sense: Unaligned write command
sd 0:0:0:0: [sda] tag#21 CDB: Write(10) 2a 00 0f 64 4c 78 00 00 10 00
blk_update_request: I/O error, dev sda, sector 258231416
...
(and repeat with increasing tag # and new blk_update sector errors)
...
EXT4-fs warning (device dm-2): ext4_end_bio:332: I/O error -5 writing to inode 13389888 (offset 0 size 28672)
Buffer I/O Error on device dm-2, logical block 58770479
...
(repeat buffer i/o errors)
...
blk_update_request (as above)
sd 0:0:0:0: rejecting I/O to offline device
...
ata1: BH  complete
...
Abortin journal on device dm-2-8
EXT4-fs (dm-2): Delayed block allocation failed for inode 14600222 at inode ...
EXT4-fs (dm-2): This should not happen!! Data will be lost
...
(and on)

My Setup

Lenovo X250
luks encrypted drive. (setup following these instructions)

☭ uname -a
Linux nixos 4.3.0 #1-NixOS SMP Thu Jan 1 00:00:01 UTC 1970 x86_64 GNU/Linux

☭ nixos-version 
15.09.706.45128de (Dingo)

☭ lspci
00:00.0 Host bridge: Intel Corporation Broadwell-U Host Bridge -OPI (rev 09)
00:02.0 VGA compatible controller: Intel Corporation Broadwell-U Integrated Graphics (rev 09)
00:03.0 Audio device: Intel Corporation Broadwell-U Audio Controller (rev 09)
00:14.0 USB controller: Intel Corporation Wildcat Point-LP USB xHCI Controller (rev 03)
00:16.0 Communication controller: Intel Corporation Wildcat Point-LP MEI Controller #1 (rev 03)
00:19.0 Ethernet controller: Intel Corporation Ethernet Connection (3) I218-LM (rev 03)
00:1b.0 Audio device: Intel Corporation Wildcat Point-LP High Definition Audio Controller (rev 03)
00:1c.0 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #6 (rev e3)
00:1c.1 PCI bridge: Intel Corporation Wildcat Point-LP PCI Express Root Port #3 (rev e3)
00:1d.0 USB controller: Intel Corporation Wildcat Point-LP USB EHCI Controller (rev 03)
00:1f.0 ISA bridge: Intel Corporation Wildcat Point-LP LPC Controller (rev 03)
00:1f.2 SATA controller: Intel Corporation Wildcat Point-LP SATA Controller [AHCI Mode] (rev 03)
00:1f.3 SMBus: Intel Corporation Wildcat Point-LP SMBus Controller (rev 03)
00:1f.6 Signal processing controller: Intel Corporation Wildcat Point-LP Thermal Management Controller (rev 03)
02:00.0 Unassigned class [ff00]: Realtek Semiconductor Co., Ltd. RTS5227 PCI Express Card Reader (rev 01)
03:00.0 Network controller: Intel Corporation Wireless 7265 (rev 59)

☭ sudo smartctl -a /dev/sda
smartctl 6.3 2014-07-26 r3976 [x86_64-linux-4.3.0] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Device Model:     SAMSUNG MZ7LN256HCHP-000L7
Serial Number:    S20HNXBG772065
LU WWN Device Id: 5 002538 d00000000
Firmware Version: EMT03L6Q
User Capacity:    256,060,514,304 bytes [256 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        Not in smartctl database [for details use: -P showall]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Thu Nov 26 11:02:36 2015 CET
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        (    0) seconds.
Offline data collection
capabilities:            (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    ( 133) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   099   099   000    Old_age   Always       -       76
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       151
170 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
171 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
172 Unknown_Attribute       0x0032   100   100   010    Old_age   Always       -       0
173 Unknown_Attribute       0x0033   099   099   005    Pre-fail  Always       -       1
174 Unknown_Attribute       0x0032   099   099   000    Old_age   Always       -       38
178 Used_Rsvd_Blk_Cnt_Chip  0x0013   100   100   010    Pre-fail  Always       -       0
180 Unused_Rsvd_Blk_Cnt_Tot 0x0013   100   100   010    Pre-fail  Always       -       773
184 End-to-End_Error        0x0033   100   100   097    Pre-fail  Always       -       0
187 Reported_Uncorrect      0x0032   100   100   000    Old_age   Always       -       0
194 Temperature_Celsius     0x0032   077   056   000    Old_age   Always       -       23
199 UDMA_CRC_Error_Count    0x003e   100   100   000    Old_age   Always       -       0
233 Media_Wearout_Indicator 0x0013   099   099   000    Pre-fail  Always       -       16768826
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       194
242 Total_LBAs_Read         0x0032   099   099   000    Old_age   Always       -       823
249 Unknown_Attribute       0x0032   099   099   000    Old_age   Always       -       256

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

Warning! SMART Selective Self-Test Log Structure error: invalid SMART checksum.
SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
  255        0    65535  Read_scanning was never started
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

jagajaga commented 8 years ago

You use ext4? I had probably the same issue in April until I've moved to F2FS. I had the same issue on 2 ssd in raid 0.

aaronlevin commented 8 years ago

@jagajaga yes, ext4. how stable have you found F2FS?

jagajaga commented 8 years ago

@aaronlevin pretty stable, 0 issues from April. Running on my laptop (intel ssd) and desktop with 2 ssd (ocz) in raid 1 nowadays.

mbakke commented 8 years ago

I'm typing this from an X250 with LUKS, LVM and ext4 on SSD. NixOS since day 1, current uptime 29 days. Do you have allowDiscards = true in boot.initrd.luks.devices?

aaronlevin commented 8 years ago

@mbakke I have experienced the issue with allowDiscards = true;. It's currently turned off, so I'll turn it back on again and we'll see how quickly the issue surfaces.

@mbakke what version of the firmware on the ssd are you using? Also, anything special in the kernel modules you're loading or any extra modprobe config?

Thanks!

mbakke commented 8 years ago

Hm, I have a different SSD altogether (Toshiba). Nothing else hardware-related in nixos configuration.

Device Model:     TOSHIBA THNSNJ512GCSU
Firmware Version: JULA0101
User Capacity:    512,110,190,592 bytes [512 GB]

Samsung has a pretty poor track record when it comes to Linux SSD firmware.. Can you check if the same happens if you boot with libata.force=noncq as per this bug report?

aaronlevin commented 8 years ago

@mbakke I had the libata.force=noncq in my extraModprobeConfig and still had the issue.

aaronlevin commented 8 years ago

@mbakke interesting that you have a different ssd. However, we have debian (Jesse) installed on a similar model as mine (same ssd, same firmware) and no issues.

mbakke commented 8 years ago

@aaronlevin you need to set that in boot.kernelParams, not extraModProbeConfig. I suspect Debian may have blacklisted NCQ on that device.

aaronlevin commented 8 years ago

@mbakke does that go in my configuration.nix or my hardware-configuration.nix?

mbakke commented 8 years ago

Either :)

aaronlevin commented 8 years ago

@mbakke ok, perhaps that was my issue: having libata.force=noncq in my extraModProbe and not kernelParams. I've added it and I'll see how stable my ssd is now.

Is there a policy around closing and re-opening? Because I'm happy to close this now and then re-open it if does not fix the issue.

Additionally, should we possibly consider generating that kernel param on detection of this ssd similar to debian? Or is that out of scope for NixOS to determine such a setting?

mbakke commented 8 years ago

If that indeed solves the issue, we can probably add the NCQ TRIM blacklist patch from above, at least if other distros are doing the same. Although it arguably should be added upstream...

(disclaimer: I don't actually know what NixOS' stance on adding kernel patches is, nor am I an official dev)

aaronlevin commented 8 years ago

@mbakke hmm, good question. We may not even need to apply the patch. For example, it might even be easier just to generate:

{ config, pkgs, ... }:

{
    # We have detected an SSD with NCQ TRIM blacklisted. 
    boot.kernelParams = [ "libata.force=noncq" ];
}

vcunat commented 8 years ago

Is there a policy around closing and re-opening? Because I'm happy to close this now and then re-open it if does not fix the issue.

I don't think there's such a policy. I would do the same as you.

aaronlevin commented 8 years ago

@vcunat thanks.

aaronlevin commented 8 years ago

After 2 days of stability (longest so far), I just hit the issue again. Ugh. I was really hoping that would resolve my issue.

To update: I tried putting libata.force=noncq in my kernelParams. This brought about some stability but I hit the issue after 48 hours.

aaronlevin commented 8 years ago

Can anyone think of any other settings that a distro like debian might set for these SSDs that NixOS is not setting?

mbakke commented 8 years ago

I don't suppose you were able to take a screendump this time? The errors should be somewhat different. Could perhaps try mounting /var/log/journal on a USB stick, or send it over network. You can also disable NCQ runtime with echo 1 > /sys/block/sda/device/queue_depth.

Remove discard from fstab too to make sure we don't hit multiple bugs. I'm on BIOS 1.15 FWIW, although we should both upgrade to 1.17. Debian follows kernel development closely and may apply all kinds of workarounds that are fixed in newer firmware.

dhess commented 8 years ago

I'm also seeing this issue on a SuperMicro SuperServer 5017R-WRF with Samsung 840 EVO SSDs, both a 250GB and a 500GB model. I'm running NixOS 15.09 and ext4. The same hardware runs Ubuntu 14.04 with no issues.

In my case, the controller resets the drive enough times that it ends up in UDMA/133 mode and appears to be stable, but I've only just managed to get nixos-install to finish, so I haven't taxed it much yet.

I added libata.force=noncq on the NixOS installer grub command line and got fewer errors than without that boot parameter, but still got them.

(BTW, I am not running luks in my configuration -- it's just straight ext4.)

aaronlevin commented 8 years ago

@mbakke I just want to preface my answers by thanking you for all your help!!

NCQ looks properly disabled and I don't have discards in my fstab (listed below). Is there a simple way to upgrade the firmware?

☭ cat /sys/block/sda/device/queue_depth 
1

here is my hardware-configuration.nix:

{ config, lib, pkgs, ... }:

{
  imports =
    [ <nixpkgs/nixos/modules/installer/scan/not-detected.nix>
    ];

  boot.initrd.availableKernelModules = [ "xhci_pci" "ehci_pci" "ahci" "usbhid" "usb_storage" ];
  boot.kernelModules = [ "kvm-intel" ];
  boot.extraModulePackages = [ ];

  fileSystems."/" =
    { device = "/dev/disk/by-uuid/bb6a6acb-055e-4d1f-9812-13c9d183bb6c";
      fsType = "ext4";
      options = "rw,relatime,nobarrier,data=ordered";
    };

  fileSystems."/boot" =
    { device = "/dev/disk/by-uuid/9cbf3855-bb29-4123-abd1-e08de2e39a36";
      fsType = "ext2";
    };

  swapDevices =
    [ { device = "/dev/disk/by-uuid/73bfe4e8-b4a6-433b-b152-73fd5702fcd8"; }
    ];

  nix.maxJobs = 4;
}

aaronlevin commented 8 years ago

@dhess can you run smartctl -a on your drive? Curious what firmware you have running.

mbakke commented 8 years ago

@dhess: Your device is explicitly blacklisted in kernel 4.1 or newer. Try setting boot.kernelPackages = pkgs.linuxPackages_4_1 and see if the problem persists. It may be easier to switch to the unstable channel if you need to compile stuff: nix-channel --add https://nixos.org/channels/nixos-unstable nixos.

@aaronlevin if you still have an X250 with Debian around, check if the backports kernel has the same issue (apt-get -t jessie-backports install linux-image-amd64). Jessie and 14.04 are both kernel 3.16, while 15.09 is 3.18. Could you also try booting 4.2 or newer with libata.force=noncqtrim instead of libata.force=noncq.

If you are on stock BIOS still, I recommend upgrading. These are first generation Broadwells after all. Download the (Windows) ISO-based installer (n10ur08w.exe IIRC) from Lenovo, extract the .iso with innoextract, convert to image with geteltorito.pl and dd onto USB stick. Instructions here.

dhess commented 8 years ago

Yes, that appears to be exactly what's happening with the newest Samsung firmware.

Based on Samsung's responses to the bug, I think that, rather than upgrading my kernel, I'll just take my business elsewhere!

Thanks for the help.

aaronlevin commented 8 years ago

@mbakke the Debian system in question was on kernel 4.1, but I will try booting with libata.force=noncqtrim instead and see if the problem persists. I hit the issue this morning, so it's happening with some regular frequency.

I'll also try upgrading the firmware but that might not happen until later this evening.

Thanks again!

aaronlevin commented 8 years ago

PS - I am on the 4.3 kernel.

aaronlevin commented 8 years ago

@mbakke do I need to have "libata" in my boot.initrd.kernelModules? It is not there currently.

mbakke commented 8 years ago

@aaronlevin the libata module is loaded automatically when needed. I assumed it was compiled-in, but since it's built as a module the libata.force options should work in extraModProbeConfig too.

aaronlevin commented 8 years ago

@mbakke just to make sure I have everything correct:

in configuration.nix I have: boot.kernelParams = [ "libata.force=noncqtrim" ]; in hardware-configuration.nix I have:

  boot.initrd.availableKernelModules = [ "xhci_pci" "ehci_pci" "ahci" "usbhid" "usb_storage" ];
  boot.kernelModules = [ "kvm-intel" ];
  boot.extraModulePackages = [ ];

So, I'm not explicitly specifying the libata module needs to be loaded, though it appears in my kernelParams. I would assume I should have to specify it in boot.initrd.kernelModules?

mbakke commented 8 years ago

You can put libata.force in either kernel command line or as a module parameter, it should be picked up either way. The libata module is loaded automatically (but it's arguably more nix-y to specify it).

dhess commented 8 years ago

Hmm, I'm now getting this same error using an Intel 535 120GB SSD, which is not on the blacklist.

(edit: yep, I can't even get through the nixos-install process with this Intel 535. Setting libata.force=noncq on the Grub boot line for the USB boot disk makes no difference.)

aaronlevin commented 8 years ago

I just hit the issue right now again. I've added "libata" to my boot.initrd.kernelModules and rebooted. we'll see.

dhess commented 8 years ago

I'm back on the Samsung 840 EVO 250GB SSD again. I managed to get the 4.1 kernel installed by putting the SATA controller in IDE mode in the system BIOS. I then rebooted and set the controller back to AHCI mode. Unfortunately, even in 4.1, I'm getting NCQ errors:

Dec 02 10:09:52 nix01 kernel: ata2.00: exception Emask 0x10 SAct 0x3fc0 SErr 0x4040000 action 0xe frozen
Dec 02 10:09:57 nix01 kernel: ata2.00: irq_stat 0x00000040, connection status changed
Dec 02 10:09:57 nix01 kernel: ata2: SError: { CommWake DevExch }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:30:00:00:20/00:00:08:00:00/40 tag 6 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:38:18:00:20/00:00:08:00:00/40 tag 7 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:40:80:00:20/00:00:0c:00:00/40 tag 8 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:48:f0:05:20/00:00:0c:00:00/40 tag 9 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:50:98:06:20/00:00:0c:00:00/40 tag 10 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:58:f8:13:21/00:00:0c:00:00/40 tag 11 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:60:08:01:a0/00:00:19:00:00/40 tag 12 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2.00: failed command: WRITE FPDMA QUEUED
Dec 02 10:09:57 nix01 kernel: ata2.00: cmd 61/08:68:28:00:a4/00:00:19:00:00/40 tag 13 ncq 4096 out
                                       res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
Dec 02 10:09:57 nix01 kernel: ata2.00: status: { DRDY }
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link down (SStatus 1 SControl 300)
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link down (SStatus 1 SControl 300)
Dec 02 10:09:57 nix01 kernel: ata2: limiting SATA link speed to 1.5 Gbps
Dec 02 10:09:57 nix01 kernel: ata2: hard resetting link
Dec 02 10:09:57 nix01 kernel: ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
Dec 02 10:09:57 nix01 kernel: ata2.00: supports DRM functions and may not be fully accessible
Dec 02 10:09:57 nix01 kernel: ata2.00: disabling queued TRIM support
Dec 02 10:09:57 nix01 kernel: ata2.00: supports DRM functions and may not be fully accessible
Dec 02 10:09:57 nix01 kernel: ata2.00: disabling queued TRIM support
Dec 02 10:09:57 nix01 kernel: ata2.00: configured for UDMA/133
Dec 02 10:09:57 nix01 kernel: ata2: EH complete

mbakke commented 8 years ago

Bizarre. Can't believe this is NixOS-specific. @dhess make sure you have the latest firmware on both drives. Intel have had similar problems in the past.

I actually have a remote machine that "died" with similar disk errors some time during/after 3.18. It may well have been the same issue, will check on it tomorrow. Hopefully there are some earlier generations left.

At this point I would try different combinations of [no]ncq and [no]ncqtrim (latter requires kernel 4.2+) to libata.force.. Note you can pass them from grub/gummiboot rather than rebuild all the time. Also verify that the options are actually picked up (dmesg?).

I'll have a look through the Debian kernel sources and try to find anything remotely related.

dhess commented 8 years ago

I did verify that 'noncq' is reflected in dmesg when specified on the Grub command line.

I do have the latest firmware on the Samsung drive as I just flashed it last night before my most recent attempts to use it. It didn't make any difference.

I've reinstalled Ubuntu 14.04 and will stress-test the machine for a few hours to see if I get any NCQ errors. I believe that version of Ubuntu is running 3.13, so I may upgrade to whichever version of Ubuntu (or maybe Jessie) has a 3.18 or later kernel so the comparison is more relevant.

dhess commented 8 years ago

As a test, I compiled GHC 7.10.2 from source using make -j12 on the Samsung 840 EVO (latest (EXT0DB6Q) firmware), on 3 different versions of Ubuntu: 14.04, 15.04, and 15.10. I didn't add any kernel command-line options or otherwise try to manually disable NCQ or TRIM support.

GHC built successfully on 14.04 and 15.04 with no SATA/NCQ issues. It's in the process of building on 15.10 as I write this, but I don't feel like more evidence is needed at this point. The system also handled the 14.04 install and 2 full distro upgrades (14.04 -> 15.04, 15.04 -> 15.10) perfectly. I'm now convinced this is not a hardware issue. In NixOS 15.09 with the same hardware, I get NCQ errors as soon as I do anything as simple as editing a file.

Here are the results of dmesg | grep -i ncq on the two most recent Ubuntu versions:

Ubuntu 15.04 (linux-image-generic 3.19.0.37.36):

[    2.347661] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst 
[    2.707964] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA

Ubuntu 15.10 (linux-image-generic 4.2.0.19.21):

[    2.459704] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst 
[    2.820158] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA

No mention of "horkage" in any logs or in dmesg.

I will probably set the SATA mode to IDE in BIOS and use NixOS that way until this is resolved.

I don't suppose there's any easy way to use a Debian/Ubuntu kernel with NixOS, is there? I Googled a bit but came up empty.

dhess commented 8 years ago

For reference, here are the outputs of lsmod and smartctl -a /dev/sda in Ubuntu 15.10:

Module                  Size  Used by
binfmt_misc            20480  1
ipmi_ssif              24576  0
intel_rapl             20480  0
iosf_mbi               16384  1 intel_rapl
x86_pkg_temp_thermal    16384  0
intel_powerclamp       16384  0
coretemp               16384  0
kvm_intel             167936  0
kvm                   512000  1 kvm_intel
crct10dif_pclmul       16384  0
crc32_pclmul           16384  0
aesni_intel           167936  0
aes_x86_64             20480  1 aesni_intel
lrw                    16384  1 aesni_intel
gf128mul               16384  1 lrw
glue_helper            16384  1 aesni_intel
ablk_helper            16384  1 aesni_intel
cryptd                 20480  2 aesni_intel,ablk_helper
sb_edac                28672  0
edac_core              53248  1 sb_edac
input_leds             16384  0
joydev                 20480  0
mei_me                 32768  0
shpchp                 36864  0
mei                    98304  1 mei_me
lpc_ich                24576  0
ioatdma                65536  0
ipmi_si                57344  0
8250_fintek            16384  0
ipmi_msghandler        49152  2 ipmi_ssif,ipmi_si
mac_hid                16384  0
lp                     20480  0
parport                49152  1 lp
autofs4                40960  2
hid_generic            16384  0
usbhid                 49152  0
hid                   118784  2 hid_generic,usbhid
igb                   188416  0
dca                    16384  2 igb,ioatdma
ahci                   36864  4
ptp                    20480  1 igb
libahci                32768  1 ahci
pps_core               20480  1 ptp
i2c_algo_bit           16384  1 igb
wmi                    20480  0

smartctl 6.4 2014-10-07 r4002 [x86_64-linux-4.2.0-19-generic] (local build)
Copyright (C) 2002-14, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF INFORMATION SECTION ===
Model Family:     Samsung based SSDs
Device Model:     Samsung SSD 840 EVO 250GB
Serial Number:    S1DBNEAD714949K
LU WWN Device Id: 5 002538 8500158f2
Firmware Version: EXT0DB6Q
User Capacity:    250,059,350,016 bytes [250 GB]
Sector Size:      512 bytes logical/physical
Rotation Rate:    Solid State Device
Device is:        In smartctl database [for details use: -P show]
ATA Version is:   ACS-2, ATA8-ACS T13/1699-D revision 4c
SATA Version is:  SATA 3.1, 6.0 Gb/s (current: 6.0 Gb/s)
Local Time is:    Wed Dec  2 22:24:52 2015 UTC
SMART support is: Available - device has SMART capability.
SMART support is: Enabled

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

General SMART Values:
Offline data collection status:  (0x00) Offline data collection activity
                    was never started.
                    Auto Offline Data Collection: Disabled.
Self-test execution status:      (   0) The previous self-test routine completed
                    without error or no self-test has ever 
                    been run.
Total time to complete Offline 
data collection:        ( 4800) seconds.
Offline data collection
capabilities:            (0x53) SMART execute Offline immediate.
                    Auto Offline data collection on/off support.
                    Suspend Offline collection upon new
                    command.
                    No Offline surface scan supported.
                    Self-test supported.
                    No Conveyance Self-test supported.
                    Selective Self-test supported.
SMART capabilities:            (0x0003) Saves SMART data before entering
                    power-saving mode.
                    Supports SMART auto save timer.
Error logging capability:        (0x01) Error logging supported.
                    General Purpose Logging supported.
Short self-test routine 
recommended polling time:    (   2) minutes.
Extended self-test routine
recommended polling time:    (  80) minutes.
SCT capabilities:          (0x003d) SCT Status supported.
                    SCT Error Recovery Control supported.
                    SCT Feature Control supported.
                    SCT Data Table supported.

SMART Attributes Data Structure revision number: 1
Vendor Specific SMART Attributes with Thresholds:
ID# ATTRIBUTE_NAME          FLAG     VALUE WORST THRESH TYPE      UPDATED  WHEN_FAILED RAW_VALUE
  5 Reallocated_Sector_Ct   0x0033   100   100   010    Pre-fail  Always       -       0
  9 Power_On_Hours          0x0032   096   096   000    Old_age   Always       -       17820
 12 Power_Cycle_Count       0x0032   099   099   000    Old_age   Always       -       27
177 Wear_Leveling_Count     0x0013   098   098   000    Pre-fail  Always       -       13
179 Used_Rsvd_Blk_Cnt_Tot   0x0013   100   100   010    Pre-fail  Always       -       0
181 Program_Fail_Cnt_Total  0x0032   100   100   010    Old_age   Always       -       0
182 Erase_Fail_Count_Total  0x0032   100   100   010    Old_age   Always       -       0
183 Runtime_Bad_Block       0x0013   100   100   010    Pre-fail  Always       -       0
187 Uncorrectable_Error_Cnt 0x0032   100   100   000    Old_age   Always       -       0
190 Airflow_Temperature_Cel 0x0032   062   062   000    Old_age   Always       -       38
195 ECC_Error_Rate          0x001a   200   200   000    Old_age   Always       -       0
199 CRC_Error_Count         0x003e   099   099   000    Old_age   Always       -       421
235 POR_Recovery_Count      0x0012   099   099   000    Old_age   Always       -       21
241 Total_LBAs_Written      0x0032   099   099   000    Old_age   Always       -       9225534273

SMART Error Log Version: 1
No Errors Logged

SMART Self-test log structure revision number 1
No self-tests have been logged.  [To run self-tests, use: smartctl -t]

SMART Selective self-test log data structure revision number 1
 SPAN  MIN_LBA  MAX_LBA  CURRENT_TEST_STATUS
    1        0        0  Not_testing
    2        0        0  Not_testing
    3        0        0  Not_testing
    4        0        0  Not_testing
    5        0        0  Not_testing
Selective self-test flags (0x0):
  After scanning selected spans, do NOT read-scan remainder of disk.
If Selective self-test is pending on power-up, resume after 0 minute delay.

aaronlevin commented 8 years ago

Hit the issue again, despite forcing the presence of libata module.

[38621.166800] ata1.00: exception Emask 0x0 SAct 0xe00000 SErr 0x50000 action 0x6 frozen
[38621.166803] ata1: SError: { PHYRdyChg CommWake }
[38621.166805] ata1.00: failed command: WRITE FPDMA QUEUED
[38621.166808] ata1.00: cmd 61/08:a8:c0:78:e5/00:00:12:00:00/40 tag 21 ncq 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166809] ata1.00: status: { DRDY }
[38621.166810] ata1.00: failed command: WRITE FPDMA QUEUED
[38621.166813] ata1.00: cmd 61/08:b0:f8:78:e5/00:00:12:00:00/40 tag 22 ncq 4096 out
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166814] ata1.00: status: { DRDY }
[38621.166815] ata1.00: failed command: READ FPDMA QUEUED
[38621.166817] ata1.00: cmd 60/08:b8:08:28:60/00:00:02:00:00/40 tag 23 ncq 4096 in
                        res 40/00:00:00:00:00/00:00:00:00:00/00 Emask 0x4 (timeout)
[38621.166818] ata1.00: status: { DRDY }

aaronlevin commented 8 years ago

It looks like the kernel param is successfully passed during boot. However, it seems like noncqtrim is not being respected? Full output below, but there are two suspicious lines:

[    0.289695] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst 
+
[    0.597066] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA

Is this expected?

full output:

☭ sudo dmesg -k | grep -i cq
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de init=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de init=/nix/store/m0d0rf0f6malq33dv09azwcs438k4c4s-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim
[    0.017667] ACPI: 12 ACPI AML tables successfully acquired and loaded
[    0.289695] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst 
[    0.596111] ata1.00: FORCE: horkage modified (noncqtrim)
[    0.597066] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[   14.158037] rtsx_pci 0000:02:00.0: rtsx_pci_acquire_irq: pcr->msi_en = 1, pci->irq = 44

aaronlevin commented 8 years ago

I added libata.force=noncqtrim,noncq to my kernelParams and now I'm seeing NCQ disabled:

☭ dmesg -k | grep -i cq
[    0.000000] Command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de init=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim,noncq
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,gpt2)//kernels/9xfh1qyj52ibmpgb5lngx5w3248lq7wz-linux-4.3-bzImage systemConfig=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de init=/nix/store/lyi1hbkhh6vnq0rg0lw0fcnrwk1ylmps-nixos-15.09.706.45128de/init loglevel=4 libata.force=noncqtrim,noncq
[    0.017666] ACPI: 12 ACPI AML tables successfully acquired and loaded
[    0.297387] ahci 0000:00:1f.2: flags: 64bit ncq pm led clo only pio slum part deso sadm sds apst 
[    0.604122] ata1.00: FORCE: horkage modified (noncqtrim)
[    0.604125] ata1.00: FORCE: horkage modified (noncq)
[    0.604651] ata1.00: 500118192 sectors, multi 1: LBA48 NCQ (not used)
[    9.653970] rtsx_pci 0000:02:00.0: rtsx_pci_acquire_irq: pcr->msi_en = 1, pci->irq = 45

We'll see how long this is stable for.

dhess commented 8 years ago

I've tried the same with AHCI mode turned back on in the BIOS, using a Linux 4.2 kernel:

[    0.000000] Command line: BOOT_IMAGE=(hd0,msdos2)/nix/store/f77jdmsx27a81qkrfvmz7hjh5c83cwkm-linux-4.2.5/bzImage systemConfig=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de init=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de/init libata.force=noncqtrim,noncq loglevel=4
[    0.000000] Kernel command line: BOOT_IMAGE=(hd0,msdos2)/nix/store/f77jdmsx27a81qkrfvmz7hjh5c83cwkm-linux-4.2.5/bzImage systemConfig=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de init=/nix/store/nr83md689m2zlf0byas7y460228p2sy4-nixos-15.09.706.45128de/init libata.force=noncqtrim,noncq loglevel=4
[    0.058802] ACPI: All ACPI Tables successfully acquired
[    0.828099] ahci 0000:00:1f.2: flags: 64bit ncq sntf pm led clo pio slum part ems apst 
[    1.145191] ata2.00: FORCE: horkage modified (noncq)
[    1.145215] ata2.00: 488397168 sectors, multi 1: LBA48 NCQ (not used)

Unfortunately, this causes a different error:

[   31.413338] ata2.00: exception Emask 0x10 SAct 0x0 SErr 0x4040000 action 0xe frozen
[   31.413368] ata2.00: irq_stat 0x00000040, connection status changed
[   31.413390] ata2: SError: { CommWake DevExch }
[   31.413407] ata2.00: failed command: READ DMA EXT
[   31.413426] ata2.00: cmd 25/00:08:d0:ba:24/00:00:1b:00:00/e0 tag 7 dma 4096 in
                        res 50/00:00:00:00:00/00:00:00:00:00/a0 Emask 0x10 (ATA bus error)
[   31.413475] ata2.00: status: { DRDY }
[   31.413491] ata2: hard resetting link
[   33.618097] ata2: SATA link down (SStatus 1 SControl 300)
[   33.973001] ata2: hard resetting link
[   36.179834] ata2: SATA link down (SStatus 1 SControl 300)
[   36.179842] ata2: limiting SATA link speed to 1.5 Gbps
[   36.474890] ata2: hard resetting link
[   36.779773] ata2: SATA link up 1.5 Gbps (SStatus 113 SControl 310)
[   36.781716] ata2.00: supports DRM functions and may not be fully accessible
[   36.781889] ata2.00: supports DRM functions and may not be fully accessible
[   36.781892] ata2.00: configured for UDMA/133
[   36.792796] ata2: EH complete

dhess commented 8 years ago

Tried the 4.3 kernel, same result, this time with a WRITE DMA EXT error forcing the controller into UDMA/133 mode.

aaronlevin commented 8 years ago

@dhess how are you forcing these issues to happen so quickly? I only hit this after several hours (and occasionally days)

dhess commented 8 years ago

@aaronlevin For me it reliably happens only a few seconds after login. Just lucky I guess :\

aaronlevin commented 8 years ago

@dhess :(

aaronlevin commented 8 years ago

@dhess to run your system with the ssd in IDE mode, did you have to re-generate hardware-configuration.nix?

dhess commented 8 years ago

No. I just made the BIOS change. Everything is working great now; it's just a shame I had to cripple the SSD performance to get here.

mbakke commented 8 years ago

@dhess your last error is closer to what I had on the mentioned remote system (also Supermicro). Didn't get to look at it yet, but can you post lspci -nv? Curious which disk controller you have.

dhess commented 8 years ago

Here you go:

lspci.txt

aaronlevin commented 8 years ago

Can we remove the needs: feedbacck tag on this?

NixOS / nixpkgs

File System / SSD Failures on Lenovo X250 (not kernel, not hardware) #11276

The Issue

My Setup