HelmSecure / armbian-images

GNU General Public License v2.0
32 stars 6 forks source link

armbian hangs under heavy disk io. #10

Closed ivahos closed 1 year ago

ivahos commented 1 year ago

Hi, I have reflashed my V2A helm, and I have been able to reformat the internal nvdimm disk as well. However, the system seems to hang after some time with heavy disk io activity with an SMP exception. I have moved my email to a docker setup, and every time I try to copy the emails from my VM that has been running my email since I migrated it off the helm server.

For me, it’s straightforward to recreate the problem. Run the following command: dd if=/dev/zero of=/nvme_mountpoint/data.dd bs=1M count=8000

This command never completes for me. The helm just goes unresponsive after around 3GB written.

dsigurds commented 1 year ago

Thanks for reporting. I'll investigate and see if I can reproduce the issue.

mongobit commented 1 year ago

mind if i ask where you got the instructions for that? i can see the 7gigs of emmc memory but not seeing my 1TB M.2 yet.

ivahos commented 1 year ago

mind if i ask where you got the instructions for that? i can see the 7gigs of emmc memory but not seeing my 1TB M.2 yet.

the nvme is /dev/nvme0n1. Just run fdisk —wipe always /dev/nvme0n1 to create a partition on it

prescottk commented 1 year ago

I can totally reproduce the issue simply by following the same instructions in your email using dd. This essentially makes the 1T disk unusable.

dsigurds commented 1 year ago

I can totally reproduce the issue simply by following the same instructions in your email using dd.

This essentially makes the 1T disk unusable.

Is your Helm a v2a as well?

prescottk commented 1 year ago

Yes it is.

From: Dirk Sigurdson @.> Sent: Wednesday, December 28, 2022 1:59 PM To: HelmSecure/armbian-images @.> Cc: Kelly Prescott @.>; Comment @.> Subject: Re: [HelmSecure/armbian-images] armbian hangs under heavy disk io. (Issue #10)

I can totally reproduce the issue simply by following the same instructions in your email using dd.

This essentially makes the 1T disk unusable.

Is your Helm a v2a as well?

— Reply to this email directly, view it on GitHub https://github.com/HelmSecure/armbian-images/issues/10#issuecomment-1366856833 , or unsubscribe https://github.com/notifications/unsubscribe-auth/ADSATBUSK3CQNWJGIDTXTCTWPSEXZANCNFSM6AAAAAATKU7TTI . You are receiving this because you commented. https://github.com/notifications/beacon/ADSATBXLVRVYQDFKSQRY4CLWPSEXZA5CNFSM6AAAAAATKU7TTKWGG33NNVSW45C7OR4XAZNMJFZXG5LFINXW23LFNZ2KUY3PNVWWK3TUL5UWJTSRPCKIC.gif Message ID: @. @.> >

rpooley commented 1 year ago

I too have experienced this issue when trying to clone the mmcblk2 to mmcblk1 using dd. The system hangs and it needs to be rebooted, attempting a new ssh connection fails.

l3nticular commented 1 year ago

Tangentially, you probably want an LVM volume on your NVM. Helm software used one so you can just clear it out instead of using fdisk.

apt install lvm2 lvdisplay lvremove vgdisplay vgremove pvremove /dev/nvme0n1

then you can do pvcreate, vgcreate, lvcreate

jimminycreeket commented 1 year ago

How did you get Docker functioning?

gomsec commented 1 year ago

Are there any extra steps I need to do for armbian to see my 1TB drive?

gomsec commented 1 year ago

I ended up reseating the drive and now my system is able to see the drive.

elyk53 commented 1 year ago

I've got another instance of this issue, also Helm v2a, 512GB SSD. I've left shell windows open monitoring journalctl -f and dmesg -w and nothing stands out in the final log lines as being related in any way to the cause of the hang.

If there's any other troubleshooting that I can perform to help get to the bottom of this, let me know.

elyk53 commented 1 year ago

In an attempt to identify workarounds, I ran the dd experiment with two additional targets:

  1. A different SSD in an external case, connected to the right-most (when looking at the device from the back) USB-C port (the drive did not enumerate when plugged in to the USB-C port closest to the ethernet port).
  2. A network share hosted on a different computer.

Both triggered the same hanging behavior.

This seems to indicate that the problem is related to the existence of high I/O load in any form, rather than being linked to something about the particular interface used by the internal SSD.

ivahos commented 1 year ago

This seems to indicate that the problem is related to the existence of high I/O load in any form, rather than being linked to something about the particular interface used by the internal SSD.

I am suspecting that this is the root cause of the strange hangs we had while helm vas solvent and running their own image. Whenever the machine did something like indexing a large folder or something else that generated large amounts of IO it had a chance to hang.

With the staff having gone missing, this leaves our helms in a bad place. The way it is now its only useful as a machine you can run cpu intensive tasks that do little io..

l3nticular commented 1 year ago

Agreed. One thought floating through my head is if there is a FUSE file system that can virtualize and artificially slow down I/O operations.

dsigurds commented 1 year ago

The original Helm OS created an LVM volume on the SSD and then overlayed that with dm-crypt to create an encrypted volume. We configured it as an LVM volume because we anticipated expanding the /data partition with external drives connected over USB. This definitely slowed down I/O. However, I've gone back to our original OS and removed those multiple layers of indirection to see if I can reproduce the crash and am unable to. This means most likely that there is an issue with the Armbian kernel build. The source used for the kernel for the original Helm OS is divergent from the source used by Armbian Rockchip devices. It will likely be difficult to track down but I'm working on it. Sorry for the issue.

kkanatcit commented 1 year ago

Insofar as being able to at least use the SSD that we have in our Helms, to get around the I/O hang, I took my SSD out and formatted it off another linux box. After I put it back in, I’ve been able to mount it and use it normally just fine.

l3nticular commented 1 year ago

@kkanatcit you will probably hit the hang at some point. Just be ready to power cycle it.

@dsigurds it also hangs with a dm-crypt volume on LVM. So it seems likely to be related to the armbian image in general since it also fails on a network share.

l3nticular commented 1 year ago

(LVM instructions are here, btw: https://reddit.com/r/HelmServerMods/comments/10pfcdf/setting_up_ssd/)

I didn’t have any issues setting up LVM or formatting, purely when doing huge data copies like that.

l3nticular commented 1 year ago

It’s interesting that it always fails around 3GB when the free ram on a base install is also about 3GB. Perhaps the file system caching is not able to keep up and exhausts the ram?

l3nticular commented 1 year ago

It’s definitely related to the file system cache. Adding flag to dd to bypass the cache works:

root@helm:~# dd if=/dev/zero of=/mnt/test/data4.dd bs=1M count=8000 oflag=direct
8000+0 records in
8000+0 records out
8388608000 bytes (8.4 GB, 7.8 GiB) copied, 23.5282 s, 357 MB/s

Note the oflag=direct.

l3nticular commented 1 year ago

Setting the file system mount as “sync” does not fix the problem. The cache still grows during the copy without the direct flag. Bad: 6A25EB1A-C5A2-48DB-8ECC-9B6CB9AFD532

good: ABE661C5-AA03-4531-972F-26CEBE46DEFA

dsigurds commented 1 year ago

The issue corresponds to the version of u-boot used. I'll be releasing a new build with a fix relatively soon. Thanks for all the reporting on it and work in trying to diagnose the issue.

dsigurds commented 1 year ago

This issue has been resolved with this image: https://github.com/HelmSecure/armbian-images/releases/download/v22.11.2-build-48/Armbian_22.11.2-build-48_Helm-v2a_bullseye_legacy_4.4.213_minimal.img

Unfortunately, the package update doesn't work to update u-boot and so a full image reflash is required to get the fix. Make sure to backup any data so you can restore after the reflash.

krioso commented 1 year ago

Is there a reason why the 4.4 kernel is being used? Is there or will there be a way to use a newer kernel? The 4.4. kernel is past end of life.

elyk53 commented 1 year ago

Reflashed and confirmed that this fixes the disk issues. Thanks for the fix!