MichaIng / DietPi

Lightweight justice for your single-board computer!
https://dietpi.com/
GNU General Public License v2.0
4.83k stars 495 forks source link

External USB3 hard drive reset #487

Closed joaofl closed 8 years ago

joaofl commented 8 years ago

I'm use a XU4 with USB3, connected to a WD Elements of 2TB external HD. It happens that, whenever it gets stressed, it sporadically resets, together with an audible "tick" coming from it (what causes bad blocks, low life span, and consequently data loss).

Searching around, the best clue I found was this: http://forums.debian.net/viewtopic.php?f=7&t=117061 due to a bug with the driver of the converter Sata -> USB3 from the hd. https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/commit/?id=a9c54caa456dccba938005f6479892b589975e6a

They claim to have fixed the issue on kernel version 3.17. I wonder if there is a fix for that, or means to upgrade the kernel to get that fixed. I'm afraid I'll lose data one of these days.

Maybe this is one reason more for: https://github.com/Fourdee/DietPi/issues/414

Thanks

Below is the error.

[ 1394.012800] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
[ 1394.027355] [c0] usb 4-1.2: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 1394.027708] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fad80
[ 1394.027735] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fadac
[ 1479.052776] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
[ 1479.067414] [c0] usb 4-1.2: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 1479.067868] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fad80
[ 1479.067896] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fadac
[ 1561.133040] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
[ 1561.147257] [c1] usb 4-1.2: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 1561.147625] [c1] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fad80
[ 1561.147656] [c1] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fadac
[ 2452.092765] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
[ 2452.107151] [c0] usb 4-1.2: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 2452.107576] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fad80
[ 2452.107598] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fadac
[ 2572.099757] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
[ 2572.113902] [c0] usb 4-1.2: Parent hub missing LPM exit latency info.  Power management will be impacted.
[ 2572.114456] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fad80
[ 2572.114489] [c0] xhci-hcd xhci-hcd.2.auto: xHCI xhci_drop_endpoint called with disabled ep dd9fadac
joaofl commented 8 years ago

Sometimes I get what seems to be some bad blocks:

[ 5004.907560] [c4] sd 0:0:0:0: [sda] Unhandled sense code
[ 5004.907581] [c4] sd 0:0:0:0: [sda]  
[ 5004.907594] Result: hostbyte=0x00 driverbyte=0x08
[ 5004.907610] [c4] sd 0:0:0:0: [sda]  
[ 5004.907622] Sense Key : 0x3 [current] 
[ 5004.907650] [c4] sd 0:0:0:0: [sda]  
[ 5004.907661] ASC=0x11 ASCQ=0x0
[ 5004.907681] [c4] sd 0:0:0:0: [sda] CDB: 
[ 5004.907692] cdb[0]=0x28: 28 00 0f 7d 00 00 00 00 80 00
[ 5004.907784] [c4] end_request: critical target error, dev sda, sector 259850240
[ 5152.019254] [c0] usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd
Fourdee commented 8 years ago

@joaofl

whenever it gets stressed, sporadically resets, together with an audible "tick"

Sounds like power requirements for the USB drive are not being met at load:

They claim to have fixed the issue on kernel version 3.17. I wonder if there is a fix for that, or means to upgrade the kernel to get that fixed. I'm afraid I'll lose data one of these days.

I'am compiling the 4.7 tobetter kernel now: http://odroid.com/dokuwiki/doku.php?id=en:xu4_building_kernel https://github.com/tobetter/linux/tree/odroidxu4-v4.7.

As you mentioned, I've been meaning to test 4.x for improved network throughput (30mb/s currently): https://github.com/Fourdee/DietPi/issues/414. So hopefully, 2 birds, 1 stone :+1:

Will post download link when its ready.

Shes one hot board lol: image

Fourdee commented 8 years ago

@joaofl I've compiled and hosted the tobetter 4.7.0 kernel (instructions https://github.com/Fourdee/DietPi/issues/414#issuecomment-243736813). Looks good here image

joaofl commented 8 years ago

@Fourdee thanks for the responsive and efficient support.

Whats the amp rating of the drive (eg: 750ma)? 

from "lsusb -v" I get MaxPower 224mA

2.5 inch or 3.5 inch external drive?

2.5, host powered, with the original 3A power supply from hardkernel. From some forums, I saw people able to power their external hd from the host usb with no problem. There are some issues related to dirty on the USB3 pins on the connector, but I have cleaned them to make sure. I tough it was current limitation in the beginning, but even the raspberry with 1,2A limit managed to power it up ok. But anyway, I can perform some more tests with a 5A power supply and see.

Or do you think this error is clearly lack of power? Anyway, it happen less often then the other reset warning. [ 5004.907622] Sense Key : 0x3 [current] [ 5004.907650] [c4] sd 0:0:0:0: [sda]

Other I can upgrade to the kernel 4.7.0 and check if it happens.

Ill keep this post up to date.

Thanks again :+1:

Fourdee commented 8 years ago

@joaofl

from "lsusb -v" I get MaxPower 224mA

I just tested my 750mA drive powered down, idle, then with a file copy. I get same results each time:

root@DietPi-XU4:~# lsusb -v | grep MaxPower                                         MaxPower                0mA
    MaxPower              180mA
    MaxPower              180mA
    MaxPower                0mA
    MaxPower               24mA
    MaxPower                0mA
    MaxPower                0mA
    MaxPower              100mA
    MaxPower                0mA
    MaxPower                0mA
    MaxPower                0mA

https://www.amazon.co.uk/Elements-Portable-External-Drive-WDBU6Y0020BBK-EESN/dp/B00D0L5BH8 They claim its USB2.0 compatible so its <=500mA, but i'd imagine its higher than 224mA when active.

original 3A power supply from hardkernel

The current HK PSU is 5v/4amp? I had unstable issues with that PSU and had to change it to 5v/8amp, now runs 24/7 for over 6 months fine.

Or do you think this error is clearly lack of power?

I'am pretty confidant this is a insufficient or unstable power issue for PSU. But lets see if the 4.7 kernel helps.

ghost commented 8 years ago

In general the if hard drive is physically failing it can pull more power causing symptoms as described. Most likely it is insufficient power supply feeding the board as already stated.

joaofl commented 8 years ago

You guys were right, it was caused by lack of current. But what is strange is the fact that it was working smoothly for almost one year now... It may be some aging issues.

I had one of this 10A power supply (cheap but good enough) hanging here, and decided to test that.

It has a small knob for minor voltage adjustments. I initially set it at 5.00V but the hd had trouble to spin up. 5 sharp according to my cheap multimeter (+ - error). Set it to 5.2V, hd worked fine, with no errors for almost one hour, but the XU4 suddenly shuts down, due to overheat. Something that never happened. Dropped down to around 5.1V, noticed it running a bit cooler, but still, limited cpu clock to 1600MHz.

It is running now for 12h with no problems at all. So, I would say the problem has been solved, although @Fourdee , I will perform those tests with the new kernel as well.

Thanks guys

joaofl commented 6 years ago

@Fourdee After some long time, I believe I have some bad news regarding this.

Few weeks ago I finally tried to migrate to kernel 4.14 (from my actual setup running kernel 3.xxx) using the exact same hardware setup, with the latest dietpi. The setup Is:

SD card with the boot SSD USB3 powered adapter with the filesystem, for better performance than the SD card. HD on USB3, externally powered (where all the junk is)

So it started well. First stress tests were all ok. That is, some intense IO on the HD simultaneously with some network transfer. Something like copying a big file from and to the HD from another computer via gigabit network.

Then, first thing, before installing apps, I moved the filesystems to my USB3 External SSD uing the dietpi scrip, as I had done before. Then I run the same "stress test" again, and this is when the problem arise: usb 4-1.2: reset SuperSpeed USB device number 3 using xhci-hcd

The exact same issue.. The needle ticking and this restart issue. Then I downgrade to the 3.x kernel, and the exact same tests work fine.

One way of avoiding the problem is to not migrate the filesystem to the USB. But in case you do, You may have the same issue. I tested It on my two Odroid XU4.

I also discussed this issue extensively on the odroid forums, but the folks there have their filysystem at the SD card or EMMC. They are not necessarily using these means you provide to transfer the filesystem to USB storage. I believe it is reproducible with any couple of USB3 devices.

Do you say something on that? Sorry to bother with this issue again, but now I'm really feeling behing without the latest updates. Cheers

Fourdee commented 6 years ago

@joaofl

The exact same issue.. The needle ticking and this restart issue. Then I downgrade to the 3.x kernel, and the exact same tests work fine.

Sounds like 4.14 isn't providing enough power to the USB port. Or the XU4 isn't getting enough power from PSU, or, external HDD's are not getting enough power, or, 4.14 is incompatible with the USB controller on the external HDD caddy. Nightmare basically lol.

Ok, even though it works on 3.x, the XU4 needs at least 5V/6A PSU in general, especially when using external HDD's. Regardless if they are bus powered or not (even with external powered USB drives, you'd need to verify if the bus is providing the actual power).

A quick and costly solution, would be to purchase the HC1/2 which has much better support for the attached SATA drives. I've been running my HC1 for months with a 64GB SATA SSD and rootFS transferred to it.

Aside from that, it literally could be anything (kernel/hardware), and a long road to debug.

joaofl commented 6 years ago

@Fourdee I really believe it is not power related, as I have tested every cable/power supply combination possible. Believe me, I have loads of them. Even with a computer power supply, with some 10A+ available.

Moreover, if the filesystem is at the SD, and I run the stress test with kernel 4.14, and IO all SSD , HD and network intensively simultaneously, I get no errors. The issue only arises after the filesystem gets copied to the SSD. Reason why I believe it is some kernel issue. Hard to tell.

ps: I was tempted to buy the HC1/2, but since they started the development of that even cooler board, I decided to hold. You are more lucky than I, since you get those for free :)

I posted this on the odroid forums as well. Lets see if anybody else can reproduce it.

Fourdee commented 6 years ago

@joaofl

You are more lucky than I, since you get those for free

😄 + import tax £30 😉