LSCHLv2: ata driver is hard resetting the SATA-Link on boot

cYrAx157 commented 7 months ago

hi there, I have a linkstation live (ls-chlv2) and I installed debian bookworm using the debootstrap script. When the device boots, I can hear that the harddisk is resetting a few times and after a few tries (sometimes it needs 20 or more retries) the boot succeeds. In the dmesg, you can see that the ata driver is hard resetting the link. The problem doesn´t occur on older debian versions (e.g. squeeze) The HDD is a 4TB WD-Red "WD40EFRX".

dmesg.log

thanks in advance

1000001101000 commented 7 months ago

I have a theory, though I wouldn't have expected it to apply to this combination of device/drive.

Somewhere between Bullseye (5.10) and Bookworm (6.1) the EXT4 filesystem started enabling TRIM by default. This can cause problems for SATA controllers that don't support TRIM, the port expander on the TS-XEL is the main one I'm familiar with.

Normally you would only encounter this issue with SSDs, but it looks like some WD RED drives report TRIM support. I wouldn't think this device would have that problem since it doesn't have a port expander but I'm not certain it wouldn't.

To protect against this issue I added "nodiscard" to the mount parameters for the rootfs. It looks like you are using EXT4 for other partitions on the disk as well. Try adding "nodiscard" to the options for all EXT4 filesystems in your /etc/fstab and see if that helps.

cYrAx157 commented 7 months ago

good theory, but, no luck :-( I tried a few things, and the only thing I can say is, that the last working distro is "stretch" with the 4.9 kernel. Bullseye doesn´t work too, same issue. Any other ideas ? Maybe it´s time for the ls-chlv2 to rest in peace....

1000001101000 commented 7 months ago

Could you install smartmontools and then post the output of smartctl -a /dev/sda

cYrAx157 commented 7 months ago

here you go:

smartctl.log

I don´t think it´s an hardware issue on the hdd side, but I don´t know if it´s on the Linkstation side. Yesterday, I tried a different HDD (Samsung), same problem. The only thing I can say is, i get a stable Debian if I use "stretch" (4.9). Because of that, I really think, the problem is related in software.

Does it help if I install debian-stretch and post a dmesg or something ?

1000001101000 commented 7 months ago

I should probably start out by saying that I believe you when you say it’s likely not the hard drive. There are a bunch of similar issues out there for various sata controllers, a lot of those threads end in dismissive comments similar to “hard drives wear out bro”….. drives me crazy.

the SMART data helps me understand a little more about the drive. I might even have the same model here somewhere. It doesn’t look like the drive is reporting massive CRC errors/etc which could hint at some things.

I would have expected most of the kirkwood devices to have the same issues since I would have thought they all have the same sata controller …. but I would have expected to hear about this from a lot of people unless this is some how specific to this model or possibly types of drives.

Are you able to confirm whether the issue happens/happened on buster or bullseye? Narrowing down the kernel version that first had the issue might point us at what the issue might be.

You could also try EXT3, and possibly not mounting the data partition to see if there’s a filesystem component to the issue. The TRIM thing was specific to EXT4 on pretty recent kernels but there could be other such things that recently became default.

rems28 commented 7 months ago

Hello, for my LS-XHL with 256 MB amount of memory, the last chance to have a workable device is to keep the 4.19 kernel from buster (10) debian. With all new kernels, it's hangs like you.

cYrAx157 commented 7 months ago

@1000001101000 okay I will try a few things and report back. @rems28 yeah, I can remember that I get a stable behaviour with a 4.x kernel so I will try buster first

rems28 commented 7 months ago

I compile new kernel 4.19.301 on this device with Debian .config from buster. After reboot, it continue to work like a charm and an uname command said me that Iam now on the new kernel.

1000001101000 commented 7 months ago

compiled on the device? I bet that took a while! That’s a good step forward.

ideally you could now try 5.10 and confirm that is broken…. then start trying kernels in-between to narrow down what kernel the issue started with. Once you’ve narrowed it down sufficiently we can look at the changes to relevant sata/fs stuff and try to determine what caused the problem.

you can probably save a lot of time grabbing armel “marvel” kernel packages from debian’s archive instead of building each one.

https://snapshot.debian.org/

rems28 commented 7 months ago

Yeah, it take approximately 30 hours on the device, but it's not important for me. I build the kernel from the kernel.org source and do not apply any debian patches. Maybe one in the long list make some problem on kirkwood CPUs. This one ? https://sources.debian.org/src/linux/6.5.13-1/debian/patches/bugfix/arm/arm-dts-kirkwood-fix-sata-pinmux-ing-for-ts419.patch/ As far as I see, the hard drive do not reboot one time at boot since I build the new kernel and it's a largely better behaviour for me. For testing, I will try with an other hard drive.

1000001101000 commented 7 months ago

That patch is in a device tree for a different device, it wouldn’t have any effect on yours.

If you wanted to determine if the problem was with my kernel or Debian’s specifically you’d need to build the same kernel version to compare.

rems28 commented 7 months ago

I've tried with 6.1 kernel today and it hangs at boot.

1000001101000 commented 6 months ago

Excellent, it sounds like you’ve now confirmed 4.19 works and 6.1 doesn’t. If bullseye didn't work 5.10 probably doesn’t but i’d check that next. From that point there are relatively few versions to check between 4.19 and 5.10.

if you can narrow it down to that point we might be able to figure out what changes to the sata driver, filesystems, etc might be the cause and start working on a fix

pjt-15e commented 1 month ago

I think I might have found the cause, though it took me a few days of "research".

As above my hard drive would hard reset several times during the boot process. Examining dmesg output showed an error involving MPP pin 10 being assigned to power-hdd when already assigned to serial 0 (aka UART0). Or words to that effect. The dmesg log also showed that the hard drive was often being connected at lower than expected speeds.

After checking through the kirkwood-88f6281 hardware documentation, kirkwood.dtsi, and kirkwood-6281.dtsi, I edited the Bookworm device tree kirkwood-lschlv2.dts file, changing "serial@12000" to "serial@12100" so that UART1 using MPP pins 13 and 14 would be active, instead of UART0 which previously used MPP pins 10 and 11. (Though serial output might be useful to help debug what's going on the connection points on the PCB aren't known to me.)

After making this change, generating and installing a new "debian_bookworm_armel.img" the boot process no longer has the hard drive resetting, and dmesg output no longer contains any warnings. The NAS reliably boots from cold in about 65 seconds. The dmesg logs also show the hard drive consistently using UDMA/133.

The above is probably not the best way to resolve the problem, but I'm happy with it so far. I've learnt a lot about the Linux boot process, device trees, and Marvel Kirkwood processors, which has kept me entertained.

Anyway, thanks for Debian_on_Buffalo.

(I would do a pull request, but don't really know how!)

1000001101000 commented 1 month ago

Sounds like solid work to me.

Might explain why I’ve not seen that with mine since I typically test with really low power ssds.

I’ll see if I can repeat your findings.

1000001101000 commented 1 month ago

I was able to confirm making that change for the UART made the error about MPP10 go away. I went ahead and updated the repo version right away.

the new dtb can be insalled by

copy it to /etc/flash-kernel/dtbs/
run flash-kernel to generate new boot files
reboot

I haven't verified the serial console works, trying that will be a task for another day. https://web.archive.org/web/20160829014742/http://buffalo.nas-central.org/wiki/Serial_and_JTAG_port_LS-XHL

rems28 commented 1 month ago

Hello, do you thing that is suitable for ls-xhl too ?

1000001101000 commented 1 month ago

Almost certainly.

Could you verify if you're getting that same MPP10 message in dmesg?

rems28 commented 1 month ago

I do not have cable for serial debug, but on a blank hard drive, I've tested the changes from the ls-xhl.dts file from the kernel.org source and the drive works like a charm on bullseye now. Before it was impossible to use that debian version after upgrade and wad impossible to use from scratch with debian installer. So I think that pjt-15e have certainly found the solution of the issue. I will now try an upgrade to bookworm and report the status tomorrow.

cYrAx157 commented 1 month ago

I think I might have found the cause, though it took me a few days of "research".

As above my hard drive would hard reset several times during the boot process. Examining dmesg output showed an error involving MPP pin 10 being assigned to power-hdd when already assigned to serial 0 (aka UART0). Or words to that effect. The dmesg log also showed that the hard drive was often being connected at lower than expected speeds.

After checking through the kirkwood-88f6281 hardware documentation, kirkwood.dtsi, and kirkwood-6281.dtsi, I edited the Bookworm device tree kirkwood-lschlv2.dts file, changing "serial@12000" to "serial@12100" so that UART1 using MPP pins 13 and 14 would be active, instead of UART0 which previously used MPP pins 10 and 11. (Though serial output might be useful to help debug what's going on the connection points on the PCB aren't known to me.)

After making this change, generating and installing a new "debian_bookworm_armel.img" the boot process no longer has the hard drive resetting, and dmesg output no longer contains any warnings. The NAS reliably boots from cold in about 65 seconds. The dmesg logs also show the hard drive consistently using UDMA/133.

The above is probably not the best way to resolve the problem, but I'm happy with it so far. I've learnt a lot about the Linux boot process, device trees, and Marvel Kirkwood processors, which has kept me entertained.

Anyway, thanks for Debian_on_Buffalo.

(I would do a pull request, but don't really know how!)

wow, nice find !! It works like a charm on my lschl-v2 too ! Thanks for that. That patch should be pushed to debian´s repo too !

rems28 commented 1 month ago

Update to Debian 12 is good. Do not see any problem at the moment. Is it a better idea to make a patch directly to kernel team ?

1000001101000 commented 1 month ago

I’ve updated the ls-xhl dtb with the same change and generated new installer images.

1000001101000 / Debian_on_Buffalo

LSCHLv2: ata driver is hard resetting the SATA-Link on boot #203