Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.83k stars 2.02k forks source link

[BUG] Persistent Hard Crashing with XFS Temp Drive in MDADM RAID0 #6350

Closed andrewseid closed 3 years ago

andrewseid commented 3 years ago

Describe the bug XFS is known to be the fastest format for plotting, and testing proves this out. However, it also seems to result in reliable hard crashes on Ubuntu, usually within the first day of beginning plotting. I have experienced this issue about 15 times, on a mix of Ubuntu GUI 20.04, Ubuntu GUI 21.04, and Ubuntu Server 21.04. I've experienced it on three different systems, two AMD builds (3960X and 3990X), and one Intel build (i7-11700K). All systems have been using between two and four Samsung 980 Pro NVMe drives in MDADM RAID0.

The issue seems to go away when I format the temp drive RAID0 array with ext4.

To Reproduce

  1. Create an XFS MDADM RAID0 array on Ubuntu 20.04 or 21.04 (GUI or server, doesn't matter), using 2-4 NVMe drives (in my case, Samsung 980 Pro 2TB, running on PCIe Gen 4).
  2. Start 10+ plotting queues with -n 5 -r, depending on system specs.
  3. Let system plot for 24-48 hours.

Expected behavior Observe eventual hard crash.

Screenshots On Ubuntu GUI, the desktop just completely freezes wherever it is. On Ubuntu Server, I got this: IMG_8562

Desktop:

Additional context Random theory that you can feel free to ignore: since this is an extremely high performance setup, maybe it's hitting some kind of performance threshold or race condition during plotting? Or maybe it's something else entirely XD Thank you!

Bigman397 commented 3 years ago

Try EXT4 instead of xfs... I personally switched as XFS was performing oddly for me on mdraid.

edit: and of course you tried that and I missed it, sorry =)

andrewseid commented 3 years ago

Try EXT4 instead of xfs... I personally switched as XFS was performing oddly for me on mdraid.

edit: and of course you tried that and I missed it, sorry =)

No problem! I'm indeed now using ext4, and it's proven quite stable. Just a shame to miss out on the ~7% performance improvement of XFS.

Others (JM/Quindor) seem to be using XFS MDRAID0 without issue, so I'm not sure what's going on with our systems. Cheers!

xorinox commented 3 years ago

I know LVM raid-0 uses MD under the hood, yet I would give it a try as using LVM might configure it differently and xfs reading the raid config via LVM might also lead to a different configuration. Assume 2 drives

wipefs -a /dev/nvme0n1 wipefs -a /dev/nvme1n1 vgcreate nvme_scratch /dev/nvme0n1 /dev/nvme1n1 lvcreate --type raid0 --stripes 2 --stripesize 256 -l 100%free -n plots nvme_scratch mkfs.xfs /dev/nvme_scratch/plots -L scratch01 mount -L scratch01 /mnt

More extreme option, try latest kernel.

andrewseid commented 3 years ago

@xorinox Thank you for the detailed instructions!

I will give this a try and let you know how it goes :D

dyn4mic commented 3 years ago

@andrewseid did you have success with lvm raid 0? I run into the same issue that my madmax raid 0 with xfs run into kernel panic. With not syncing. Same disks with hpool or ext4 no issues. I run Ubuntu server 21.04 as well.

pwntr commented 3 years ago

Same issue here with xfs after just a few minutes of plotting (full freeze, needed to hard-reset, but no entries in the logs/journal). A switch to ext4 seemed promising at first, but after ~23 hours on non-stop plotting with the madmax plotter, the same issue appeared. Same with the base chia plotter, albeit at a later stage.

What provided final salvation was disabling continuous trim on my raid 0 temp drives! I'm currently running the lvm based raid with xfs setup proposed by @xorinox, and after over 36 hours without a single crash/freeze, I'm somewhat confident to say that continuos trim might be the culprit rather than the filesystem (as his steps mount the array without continuous trim enabled, and I did not explicitly enable continuos trim in /etc/fstab via the discard option or in the lvm config).

To everyone who experiences the freezes/crashes with a raid 0 setup for the temp drives: did you try to run your plotting without continuous trim enabled on the array yet (e.g. no discard mount option in /etc/fstab)?

I'll try a vanilla mdadm raid 0 with xfs and without the lvm layer again soon just to cross-check and further isolate the issue. My best bet is still on disabling continuous trim though. There's a bunch of issues around this in the kernel maling lists and even in the source itself documented for different drives and configs, so it wouldn't surprise me for this to be the culprit of the iffyness we see with a raid 0 and hammering IO to those poor drives.

Also just for completeness: plotting on single drives without raid 0 was stable at all times, regardless if I was using xfs or ext4 as the fs, and also regardless if continuous trim was enabled on them or not. The single drive setup was just a fair bit slower than the raid 0 for my particular drives.

System config:

DrShotsPHD commented 3 years ago

@pwntr, I tried xfs on mdadm with discard unset and the system still crashed FYI.

dyn4mic commented 3 years ago

i can confirm that @xorinox solution worked fine, whatever ubuntus lvm does under the hood it allowed me to have a 8 disk raid 0 with xfs

ramin-afshar commented 3 years ago

I have exactly the same problem. Crashes usually happen after a couple of minutes into a plot. I used to have the drives mounted without the discard option and my raid still crashed. Might give ext4 a go. Or maybe btrfs and see how that goes.

ramin-afshar commented 3 years ago

I have exactly the same problem. Crashes usually happen after a couple of minutes into a plot. I used to have the drives mounted without the discard option and my raid still crashed. Might give ext4 a go. Or maybe btrfs and see how that goes.

update

I tried the commands provided by @xorinox to see if I could create a stable xfs filesystem this way, but unfortunately when using the mkfs.xfs command after creating the volumes with LVM it never completes the format, the process just hangs there, I even waited an hour to be sure. I can format the volume as ext4 or btrfs without any issues. And I can format a single non-raid drive as xfs. So there must be some kind of bug i guess. I'm using ubuntu 20.04 with kernel 5.8.0-55-generic.

update 2 I managed to format the volume adding following options and I've already completed a couple of plots without any crashes. sudo mkfs.xfs -f -d su=512k,sw=2 /dev/mp510_raid/raid0

xorinox commented 3 years ago

@ramin-afshar is it hanging in there discarding blocks? Run this in a screen/background until it's finished. Can take some time (up to hours).

Spektre99 commented 3 years ago

How does one break apart a RAID created with vgcreate and lvcreate as given above?

Nothing shows when looking at cat /proc/mdstat despite having made a RAID0 array using xorinox's post as a guide above.

ramin-afshar commented 3 years ago

How does one break apart a RAID created with vgcreate and lvcreate as given above?

lvremove

https://linux.die.net/man/8/lvremove

ramin-afshar commented 3 years ago

@ramin-afshar is it hanging in there discarding blocks? Run this in a screen/background until it's finished. Can take some time (up to hours).

I see this sample output data when I execute the mkfs command in terminal, but it never finished. But when I used the options i mentioned in my previous message I don't have this problem.

meta-data=/dev/xvdf isize=512 agcount=4, agsize=26214400 blks = sectsz=512 attr=2, projid32bit=1 = crc=1 finobt=1, sparse=1, rmapbt=0 = reflink=0 data = bsize=4096 blocks=104857600, imaxpct=25 = sunit=0 swidth=0 blks naming =version 2 bsize=4096 ascii-ci=0, ftype=1 log =internal log bsize=4096 blocks=51200, version=2 = sectsz=512 sunit=0 blks, lazy-count=1 realtime =none extsz=4096 blocks=0, rtextents=0

andrewseid commented 3 years ago

@xorinox Thank you again for the detailed instructions on trying LVM. I did eventually get around to trying RAID0 + XFS + LVM as you suggested, but unfortunately, the crashing issue persisted.

I've also tried several different configurations (Discard On/Off, Clear Linux, MadMax), and those crash as well.

It seems likely the issue is fundamental as @pwntr noted:

There's a bunch of issues around this in the kernel mailing lists...

My conclusion is that we must currently choose between: 1) Single drive XFS (highest performance) 2) RAID0 w/ BTRFS or EXT4 (slightly lower performance, but more drives = more capacity)

The only thing I didn't try was using a kernel newer than the default kernels in Ubuntu Server 21.04 and Clear Linux Server 34700.

Would still love to hear from anyone with additional and/or different conclusions!

Cheers!

Jericon commented 3 years ago

I’m having the same issue. Today I converted my NVME raid 1 that I was using for plotting to raid 0. Almost immediately I started getting crashes.

Probably going to switch to single drives and just run two plotters against them.

I wonder, is everyone who has this problem running NVME drives in raid 0? I feel like that might be the issue.

xorinox commented 3 years ago

How does one break apart a RAID created with vgcreate and lvcreate as given above?

Nothing shows when looking at cat /proc/mdstat despite having made a RAID0 array using xorinox's post as a guide above.

As for my brief example: umount /mnt vgremove nvme_scratch wipefs -a /dev/nvme0n1 wipefs -a /dev/nvme1n1

xorinox commented 3 years ago

I’m having the same issue. Today I converted my NVME raid 1 that I was using for plotting to raid 0. Almost immediately I started getting crashes.

Probably going to switch to single drives and just run two plotters against them.

I wonder, is everyone who has this problem running NVME drives in raid 0? I feel like that might be the issue.

It would be interesting to know if for my brief example you experience these crashes also if you used EXT4 instead of XFS?

Jericon commented 3 years ago

It would be interesting to know if for my brief example you experience these crashes also if you used EXT4 instead of XFS?

Machine 1 has been running MD Raid 0 with 2x NVME and XFS, experiences random crashes Machine 2 has been running MD Raid 1 with 2x NVME and xfs, no crashes. Converted Machine 2 to MD Raid 0 and XFS, immediately began crashing. Changed Machine 2 to MD Raid 0 and EXT4, No crashes since.

Also converted Machine 1 to EXT4, no crashes since.

ShortRouter commented 3 years ago

For me the solution from @xorinox fixed the kernel panics in Ubuntu Server 20.04. MDADM raid0 --> LVM raid0. Still XFS and mounted with discard option, so it was the MDADM. Many thanks.

malventano commented 3 years ago

Was anyone able to capture the kernel panic details? This is likely something that should be forwarded onto either/both the MDADM and XFS maintainers as one of the two is triggering these panics under the heavy IO load issued by madmax.

ShortRouter commented 3 years ago

@malventano As I pointed out in my comment, only variable I changed was MDADM. Using LVM + XFS is rock solid, no issues. So it must be a problem with MDADM.

malventano commented 3 years ago

@malventano As I pointed out in my comment, only variable I changed was MDADM. Using LVM + XFS is rock solid, no issues. So it must be a problem with MDADM.

Understood, but it could still be an issue on the XFS end that is not playing nicely with MDADM. As it is XFS doesn't like MDADM's default chunk size, but I've tried custom sizes so it agrees with XFS better but still has the issue.

malventano commented 3 years ago

Hmm:

root@chia:/home/chia# wipefs /dev/nvme[0,2]n1 -a
/dev/nvme0n1: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
/dev/nvme2n1: 8 bytes were erased at offset 0x00000218 (LVM2_member): 4c 56 4d 32 20 30 30 31
root@chia:/home/chia# vgcreate scratch /dev/nvme0n1 /dev/nvme2n1
  Physical volume "/dev/nvme0n1" successfully created.
  Physical volume "/dev/nvme2n1" successfully created.
  Volume group "scratch" successfully created
root@chia:/home/chia# lvcreate --type raid0 --stripes 2 --stripesize 256 -l 100%free -n temp scratch
  Volume group "scratch" not found
  Cannot process volume group scratch
root@chia:/home/chia# vgcreate scratch /dev/nvme0n1 /dev/nvme2n1
  Physical volume '/dev/nvme0n1' is already in volume group 'scratch'
  Unable to add physical volume '/dev/nvme0n1' to volume group 'scratch'
  /dev/nvme0n1: physical volume not initialized.
  Physical volume '/dev/nvme2n1' is already in volume group 'scratch'
  Unable to add physical volume '/dev/nvme2n1' to volume group 'scratch'
  /dev/nvme2n1: physical volume not initialized.
root@chia:/home/chia#

...well that's a new one.

edit 1 - tried all sorts of clearing of superblocks and wipefs, but ultimately it would only work after a reboot. edit 2 - perf appears identical to MD except that even if mounted with discard set, iostat shows deletions are not being discarded (but fstrim does pass). I then reverted back to MD array without discard set and reproduced the hang. As another data point, the same MD array setup on a slower (18 core) system does not hang, but a 38 core system does, so it appears somewhat IO-loading sensitive. Confirmed mount options were identical between stable LVM config and unstable MDADM config. Todo: Will reconfirm mkfs.xfs output identical on both configs. edit 3 - confirmed mkfs.xfs stats are identical for both configs (minus a few more available blocks for the LVM config).