Closed geerlingguy closed 3 years ago
I'm going to try out the IO Crest 4-port SATA adapter.
It has arrived!
And... I just realized I have no SATA power supply cable, just the data cable. So I'll have to wait for one of those to come in before I can actually test one of my SATA drives.
First light is good:
$ lspci
01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11) (prog-if 01 [AHCI 1.0])
Subsystem: Marvell Technology Group Ltd. Device 9215
Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Interrupt: pin A routed to IRQ 0
Region 0: I/O ports at 0000
Region 1: I/O ports at 0000
Region 2: I/O ports at 0000
Region 3: I/O ports at 0000
Region 4: I/O ports at 0000
Region 5: Memory at 600040000 (32-bit, non-prefetchable) [size=2K]
Expansion ROM at 600000000 [size=256K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit-
Address: 00000000 Data: 0000
Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 128 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance De-emphasis: -6dB
LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
Capabilities: [e0] SATA HBA v0.0 BAR4 Offset=00000004
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
Though dmesg
shows that it's hitting BAR default address space limits again:
[ 0.925795] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[ 0.925818] brcm-pcie fd500000.pcie: No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[ 0.925884] brcm-pcie fd500000.pcie: MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
[ 0.925948] brcm-pcie fd500000.pcie: IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[ 0.953526] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[ 0.953827] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[ 0.953844] pci_bus 0000:00: root bus resource [bus 00-ff]
[ 0.953866] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
[ 0.953933] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[ 0.954172] pci 0000:00:00.0: PME# supported from D0 D3hot
[ 0.957560] PCI: bus0: Fast back to back transfers disabled
[ 0.957582] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[ 0.957802] pci 0000:01:00.0: [1b4b:9215] type 00 class 0x010601
[ 0.957874] pci 0000:01:00.0: reg 0x10: [io 0x8000-0x8007]
[ 0.957911] pci 0000:01:00.0: reg 0x14: [io 0x8040-0x8043]
[ 0.957947] pci 0000:01:00.0: reg 0x18: [io 0x8100-0x8107]
[ 0.957984] pci 0000:01:00.0: reg 0x1c: [io 0x8140-0x8143]
[ 0.958021] pci 0000:01:00.0: reg 0x20: [io 0x800000-0x80001f]
[ 0.958058] pci 0000:01:00.0: reg 0x24: [mem 0x00900000-0x009007ff]
[ 0.958095] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0003ffff pref]
[ 0.958262] pci 0000:01:00.0: PME# supported from D3hot
[ 0.961586] PCI: bus1: Fast back to back transfers disabled
[ 0.961605] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[ 0.961674] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[ 0.961698] pci 0000:01:00.0: BAR 6: assigned [mem 0x600000000-0x60003ffff pref]
[ 0.961722] pci 0000:01:00.0: BAR 5: assigned [mem 0x600040000-0x6000407ff]
[ 0.961744] pci 0000:01:00.0: BAR 4: no space for [io size 0x0020]
[ 0.961759] pci 0000:01:00.0: BAR 4: failed to assign [io size 0x0020]
[ 0.961774] pci 0000:01:00.0: BAR 0: no space for [io size 0x0008]
[ 0.961788] pci 0000:01:00.0: BAR 0: failed to assign [io size 0x0008]
[ 0.961803] pci 0000:01:00.0: BAR 2: no space for [io size 0x0008]
[ 0.961817] pci 0000:01:00.0: BAR 2: failed to assign [io size 0x0008]
[ 0.961831] pci 0000:01:00.0: BAR 1: no space for [io size 0x0004]
[ 0.961845] pci 0000:01:00.0: BAR 1: failed to assign [io size 0x0004]
[ 0.961860] pci 0000:01:00.0: BAR 3: no space for [io size 0x0004]
[ 0.961873] pci 0000:01:00.0: BAR 3: failed to assign [io size 0x0004]
[ 0.961891] pci 0000:00:00.0: PCI bridge to [bus 01]
[ 0.961914] pci 0000:00:00.0: bridge window [mem 0x600000000-0x6000fffff]
[ 0.962217] pcieport 0000:00:00.0: enabling device (0140 -> 0142)
[ 0.962439] pcieport 0000:00:00.0: PME: Signaling with IRQ 55
[ 0.962813] pcieport 0000:00:00.0: AER: enabled with IRQ 55
I just increased the BAR allocation following the directions in this Gist, but when I rebooted (without the card in), I got:
[ 0.926161] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[ 0.926184] brcm-pcie fd500000.pcie: No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[ 0.926247] brcm-pcie fd500000.pcie: MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[ 0.926312] brcm-pcie fd500000.pcie: IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[ 1.521386] brcm-pcie fd500000.pcie: link down
Powering off completely, then booting again, it works. So note to self: if you get a link down
, try a hard power reset instead of reboot
.
Ah... looking closer, those 'failed to assign' errors are for IO BARs, which are unsupported on the Pi.
So... I posted in the BAR space thread on Pi Forums asking 6by9 if that user has had the same logs and if they can be safely ignored. Still waiting on a way to power my drive so I can do an end-to-end test :)
something else that may be interesting is if you can get a sas adapter/raid card working. I know I was looking into SBCs w/ pcie awhile back for the purpose of building a low power/low heat host for some sas drives I have. (ended up just throwing it in a computer and not running 24/7)
That would be an interesting thing to test, though it'll have to wait a bit as I'm trying to get through some other cards and might also test 2.5 Gbps or 5 Gbps networking if I am able to!
Without the kernel modules enabled, lsblk
shows no device:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
mmcblk0 179:0 0 29.8G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /boot
└─mmcblk0p2 179:2 0 29.6G 0 part /
Going to try adding those modules and see what happens!
# Install dependencies
sudo apt install -y git bc bison flex libssl-dev make libncurses5-dev
# Clone source
git clone --depth=1 https://github.com/raspberrypi/linux
# Apply default configuration
cd linux
export KERNEL=kernel7l # use kernel8 for 64-bit, or kernel7l for 32-bit
make bcm2711_defconfig
# Customize the .config further with menuconfig
make menuconfig
# Enable the following:
# Device Drivers:
# -> Serial ATA and Parallel ATA drivers (libata)
# -> AHCI SATA support
# -> Marvell SATA support
#
# Alternatively add the following in .config manually:
# CONFIG_ATA=m
# CONFIG_ATA_VERBOSE_ERROR=y
# CONFIG_SATA_PMP=y
# CONFIG_SATA_AHCI=m
# CONFIG_SATA_MOBILE_LPM_POLICY=0
# CONFIG_ATA_SFF=y
# CONFIG_ATA_BMDMA=y
# CONFIG_SATA_MV=m
nano .config
# (edit CONFIG_LOCALVERSION and add a suffix that helps you identify your build)
# Build the kernel and copy everything into place
make -j4 zImage modules dtbs # 'Image' on 64-bit
sudo make modules_install
sudo cp arch/arm/boot/dts/*.dtb /boot/
sudo cp arch/arm/boot/dts/overlays/*.dtb* /boot/overlays/
sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/
sudo cp arch/arm/boot/zImage /boot/$KERNEL.img
Yahoo, it worked!
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 223.6G 0 disk
├─sda1 8:1 1 256M 0 part /media/pi/boot
└─sda2 8:2 1 223.3G 0 part /media/pi/rootfs
mmcblk0 179:0 0 29.8G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /boot
└─mmcblk0p2 179:2 0 29.6G 0 part /
Repartitioning the drive:
sudo fdisk /dev/sda
d 1 # delete partition 1
d 2 # delete partition 2
n # create new partition
p # primary (default)
1 # partition 1 (default)
2048 # First sector (default)
468862127 # Last sector (default)
w # write new partition table
Got the following:
The partition table has been altered.
Failed to remove partition 1 from system: Device or resource busy
Failed to remove partition 2 from system: Device or resource busy
Failed to add partition 1 to system: Device or resource busy
The kernel still uses the old partitions. The new table will be used at the next reboot.
Syncing disks.
Rebooted the Pi, then:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 223.6G 0 disk
└─sda1 8:1 1 223.6G 0 part
mmcblk0 179:0 0 29.8G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /boot
└─mmcblk0p2 179:2 0 29.6G 0 part /
To format the device, use mkfs
:
$ sudo mkfs.ext4 /dev/sda1
mke2fs 1.44.5 (15-Dec-2018)
Discarding device blocks: done
Creating filesystem with 58607510 4k blocks and 14655488 inodes
Filesystem UUID: dd4fa95d-edbf-4696-a9e1-ddf1f17da580
Superblock backups stored on blocks:
32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208,
4096000, 7962624, 11239424, 20480000, 23887872
Allocating group tables: done
Writing inode tables: done
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done
Then mount it somewhere:
$ sudo mkdir /mnt/sata-sda
$ sudo mount /dev/sda1 /mnt/sata-sda
$ mount
...
/dev/sda1 on /mnt/sata-sda type ext4 (rw,relatime)
$ df -h
Filesystem Size Used Avail Use% Mounted on
...
/dev/sda1 220G 61M 208G 1% /mnt/sata-sda
Performance testing of the Kingston SA400S37/240G drive:
Test | Result |
---|---|
hdparm | 314.79 MB/s |
dd | 189.00 MB/s |
random 4K read | 22.98 MB/s |
random 4K write | 55.02 MB/s |
Compare that to the same drive over USB 3.0 using a USB to SATA adapter:
Test | Result |
---|---|
hdparm | 296.71 MB/s |
dd | 149.00 MB/s |
random 4K read | 20.59 MB/s |
random 4K write | 28.54 MB/s |
So not a night-and-day difference like with the NVMe drives, but definitely and noticeably faster. I'm now waiting on another SSD and a power splitter to arrive so I can test multiple SATA SSDs on this card.
And someone just mentioned they have some RAID cards they'd be willing to send me. Might have to pony up for a bunch of hard drives and have my desk turn into some sort of frankemonster NAS-of-many-drives soon!
I'm curious about other OS's. Obviously, Raspbian is a good basis - but as I recall, Fedora Pi 64-bit uses their own custom kernel. I'd be interested in seeing what they've "left in" from the standard kernel config.
I'm looking forward to picking one of these up in a month or so when they become available to the public, then I'll give it a try!
Side note for your list page - could you include PCI ID's as well as just the brand names of the cards? It'll help avoid confusion where cards have multiple revisions, as well as help non-US users identify comparable cards in their own markets.
Great work in the meantime! :+1:
And someone just mentioned they have some RAID cards they'd be willing to send me. Might have to pony up for a bunch of hard drives and have my desk turn into some sort of frankemonster NAS-of-many-drives soon!
It would be great to test a RAID card based on Marvell 88SE9128 chipset, because it is used by many suppliers
Trying again today (but cross-compiling this time since it's oh-so-much faster) now that I have two drives and the appropriate power adapters. I'm planning on just testing a file copy between the drives for now, I'll get into other tests later.
Hmm... putting this on pause. My cross compilation is not dropping in the AHCI module for some reason, probably a bad .config
:/
Also, the adapter gets hot after prolonged use.
(For anyone interested in testing on an LSI/IBM SAS card, check out https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/18)
My desk is becoming a war zone:
Plan is to set up a RAID (probably either 0 if I feel more YOLO-y or 1/10 if I'm more stable-minded) with either 2 or 4 drives, using mdadm.
I was having trouble with the SAS card, not sure if the cards are bad or they just don't work at all with the Pi :(
Testing also with an NVMe using the IO Crest PCIe switch:
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 223.6G 0 disk
sdb 8:16 1 223.6G 0 disk
└─sdb1 8:17 1 223.6G 0 part
mmcblk0 179:0 0 29.8G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /boot
└─mmcblk0p2 179:2 0 29.6G 0 part /
nvme0n1 259:0 0 232.9G 0 disk
I'll post some benchmarks copying files between one of the SSDs and the NVMe; will be interesting to see how many MB/sec they can pump through the switch.
For a direct file copy from one drive to another:
# fallocate -l 10G /mnt/nvme/test.img
# pv /mnt/nvme/test.img > /mnt/sata-sda/test.img
I got an average of 190 MiB/sec, or about 1.52 Gbps. So two-way, that's 3.04 Gbps (under the 3.2 Gbps I was hoping for, but that's maybe down to PCIe switching?
It looks like CPU goes to 99% as SDA takes more than 50% of the CPU—see atop results during a copy:
Also comparing raw disk speeds through the PCIe switch:
Test | Result |
---|---|
hdparm | 364.23 MB/s |
dd | 148.00 MB/s |
random 4K read | 28.89 MB/s |
random 4K write | 58.01 MB/s |
Test | Result |
---|---|
hdparm | 363.81 MB/s |
dd | 166.00 MB/s |
random 4K read | 46.50 MB/s |
random 4K write | 75.41 MB/s |
These were on 64-bit Pi OS... so the numbers are a little higher than the 32-bit Pi OS results from earlier in the thread. But the good news is the PCIe switching seems to not cause any major performance penalty.
Software RAID0 testing using mdadm
:
# Install mdadm.
sudo apt install -y mdadm
# Create a RAID0 array using sda1 and sdb1.
sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=2 /dev/sd[a-b]1
# Create a mount point for the new RAID device.
sudo mkdir /mnt/raid0
# Format the RAID device.
sudo mkfs.ext4 /dev/md0
# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid0
Benchmarking the device:
Test | Result |
---|---|
hdparm | 293.35 MB/s |
dd | 168.00 MB/s |
random 4K read | 24.96 MB/s |
random 4K write | 52.26 MB/s |
And during the 4K tests in iozone, I can see the sda/sdb devices are basically getting the same bottlenecks, except with a tiny bit of extra overhead from software-based RAID control:
Then to stop and remove the RAID0 array:
sudo umount /mnt/raid0
sudo mdadm --stop /dev/md0
sudo mdadm --zero-superblock /dev/sd[a-b]1
sudo mdadm --remove /dev/md0
Software RAID1 (mirrored) testing using mdadm
:
# Install mdadm.
sudo apt install -y mdadm
# Create a RAID1 array using sda1 and sdb1.
sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sd[a-b]1
# Create a mount point for the new RAID device.
sudo mkdir /mnt/raid1
# Format the RAID device.
sudo mkfs.ext4 /dev/md0
# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid1
And if you want the RAID device to be persistent:
# Add the following line to the bottom of /etc/fstab:
/dev/md0 /mnt/raid1/ ext4 defaults,noatime 0 1
Configure mdadm to start the RAID at boot:
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf
And check on the health of the array:
sudo mdadm --detail /dev/md0
Thanks to Magpi for their article Build a Raspberry Pi NAS.
Benchmarking the device:
Test | Result |
---|---|
hdparm | 304.63 MB/s |
dd | 114.00 MB/s |
random 4K read | 4.83 MB/s |
random 4K write | 8.43 MB/s |
While it was doing the 4K testing on the software RAID1 array, IO ran a bit slower (both sda/sdb were ~100% the whole time or thereabouts):
The md0_resync process seemed to be the main culprit. Mirroring drives in software RAID seems to be a fairly heavyweight operation when you're writing tons of small files. For large files it didn't seem to be nearly as much of a burden. I ran iozone
with a 1024K block size and got 253.63 MB/sec read, 125.70 MB/sec write.
Even at a 128K block size, I got over 100 MB/sec read and write. It really started to slow down around 8K and even 16K block sizes (to ~20 MB/sec), before falling apart at 4K (4-8 MB/sec, as slow as a microSD card!).
Hmm... I'm seeing md0_resync
continue to run for a long while after the test. So how are they getting out of sync in the first place? Maybe it is trying to sync data that was already on the drive? I thought I had reformatted them though...
Also seeing a lot in dmesg:
[ 3390.917579] cpu cpu0: dev_pm_opp_set_rate: failed to find current OPP for freq 18446744073709551604 (-34)
[ 3390.917596] raspberrypi-clk soc:firmware:clocks: Failed to change fw-clk-arm frequency: -12
And it looks like the resync is almost complete. I'll run the benchmark again afterwards.
sudo mdadm --detail /dev/md0
/dev/md0:
Version : 1.2
Creation Time : Tue Nov 10 16:25:37 2020
Raid Level : raid1
Array Size : 234297920 (223.44 GiB 239.92 GB)
Used Dev Size : 234297920 (223.44 GiB 239.92 GB)
Raid Devices : 2
Total Devices : 2
Persistence : Superblock is persistent
Intent Bitmap : Internal
Update Time : Tue Nov 10 16:45:10 2020
State : clean, resyncing
Active Devices : 2
Working Devices : 2
Failed Devices : 0
Spare Devices : 0
Consistency Policy : bitmap
Resync Status : 95% complete
Name : raspberrypi:0 (local to host raspberrypi)
UUID : 19fd4119:91925607:9b4f77f9:56c91824
Events : 494
Number Major Minor RaidDevice State
0 8 1 0 active sync /dev/sda1
1 8 17 1 active sync /dev/sdb1
It looks like the resync was the major issue—now that it's complete, numbers are looking much better:
Test | Result |
---|---|
hdparm | 351.38 MB/s |
dd | 114.00 MB/s |
random 4K read | 27.95 MB/s |
random 4K write | 43.21 MB/s |
What I'd like to test with my 4 spinning disks once I get the rest of my SATA cables in the mail today:
For the spinning disks (500GB WD5000AVDS), I partitioned, formatted, and mounted them, then I ran my benchmarking tests against them:
Test | Result |
---|---|
hdparm | 72.43 MB/s |
dd | 67.30 MB/s |
random 4K read | 0.48 MB/s |
random 4K write | 0.60 MB/s |
Sometimes you forget just how good we have it with flash memory nowadays. These drives are not a great option as boot volumes for the Pi :P
I then put two of them in a RAID0 stripe with mdadm
, and ran the same test:
Test | Result |
---|---|
hdparm | 154.33 MB/s |
dd | 109.00 MB/s |
random 4K read | 0.71 MB/s |
random 4K write | 1.60 MB/s |
I also set up SMB:
# Install Samba.
sudo apt install -y samba samba-common-bin
# Create a shared directory.
sudo mkdir /mnt/raid0/shared
sudo chmod -R 777 /mnt/raid0/shared
# Add the text below to the the bottom of the Samba config.
sudo nano /etc/samba/smb.conf
[shared]
path=/mnt/raid0/shared
writeable=Yes
create mask=0777
directory mask=0777
public=no
# Restart Samba daemon.
pi@raspberrypi:~ $ sudo systemctl restart smbd
# Create a Samba password for the Pi user.
pi@raspberrypi:~ $ sudo smbpasswd -a pi
# (On another computer, connect to smb://[pi ip address])
I averaged 75 MB/sec copy performance over the Pi's built-in Gigabit interface for a single large file, 55 MB/sec using rsync with a directory of medium-sized video clips.
Ouch, the initial resync is even slower on these spinny disk drives than it was on the SSDs (which, of course, are half the size in the first place, in addition to being twice as fast). 1% per minute on the sync.
Apparently you could completely skip this option with --assume-clean
... but there are many caveats and that's not really intended to happen unless you're in a disaster recovery scenario and you don't want anything to touch the drives when you initialize the RAID device.
So good to know that you should probably plan on letting your array sync up the first time you get it running.
Hmm... now trying all four drives:
$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: RUN_ARRAY failed: Unknown error 524
I then zeroed out the superblock:
sudo mdadm --zero-superblock /dev/sd[a-d]1
But then when I tried to create again, I got:
mdadm: super1.x cannot open /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 is not suitable for this array.
mdadm: create aborted
So I'm going to reboot and try again. Maybe I have a bad drive 😢
Debugging:
$ cat /proc/mdstat
Personalities :
md0 : inactive sdd1[3](S)
488253464 blocks super 1.2
unused devices: <none>
Trying to format it again with fdisk
, I got Failed to add partition 1 to system: Invalid argument
. Very odd behavior, but I'm thinking there's a good chance this drive is toast. That's what you get for buying refurbished!
No matter what I try, I keep getting mdadm: RUN_ARRAY failed: Unknown error 524
in the end.
Weird. After finding this question on Stack Exchange, I tried:
# echo 1 > /sys/module/raid0/parameters/default_layout
And this time, it works:
$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.
We'll see how much further I can go.
All four drives in RAID0:
Test | Result |
---|---|
hdparm | 327.32 MB/s |
dd | 155.00 MB/s |
random 4K read | 4.46 MB/s |
random 4K write | 4.71 MB/s |
Note: The card is getting HOT:
Another fun thing I just noticed—ext4lazyinit
is still running and making it so I can't unmount the volume without forcing it. If I'm going to repartition and reformat anyways, what's the point of letting it finish?
Resetting the array:
sudo umount /mnt/raid0
sudo mdadm --stop /dev/md0
sudo mdadm --zero-superblock /dev/sd[a-d]1
sudo mdadm --remove /dev/md0
Then set it to RAID 10:
# Install mdadm.
sudo apt install -y mdadm
# Create a RAID10 array using four drives.
sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/sd[a-d]1
# Create a mount point for the new RAID device.
sudo mkdir -p /mnt/raid10
# Format the RAID device.
sudo mkfs.ext4 /dev/md0
# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid10
Confirm the RAID 10 drive gives me 1 TB of mirrored/striped storage:
$ df -h
Filesystem Size Used Avail Use% Mounted on
/dev/md0 915G 77M 869G 1% /mnt/raid1
$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 1 465.3G 0 disk
└─sda1 8:1 1 465.3G 0 part
└─md0 9:0 0 930.3G 0 raid10 /mnt/raid1
sdb 8:16 1 465.3G 0 disk
└─sdb1 8:17 1 465.3G 0 part
└─md0 9:0 0 930.3G 0 raid10 /mnt/raid1
sdc 8:32 1 465.3G 0 disk
└─sdc1 8:33 1 465.3G 0 part
└─md0 9:0 0 930.3G 0 raid10 /mnt/raid1
sdd 8:48 1 465.8G 0 disk
└─sdd1 8:49 1 465.8G 0 part
└─md0 9:0 0 930.3G 0 raid10 /mnt/raid1
mmcblk0 179:0 0 29.8G 0 disk
├─mmcblk0p1 179:1 0 256M 0 part /boot
└─mmcblk0p2 179:2 0 29.6G 0 part /
And now the great wait for resync watching sudo mdadm --detail /dev/md0
.
Every 2.0s: sudo mdadm --detail /dev/md0 raspberrypi: Tue Nov 10 23:54:32 2020
/dev/md0:
Version : 1.2
Creation Time : Tue Nov 10 23:47:10 2020
Raid Level : raid10
Array Size : 975458304 (930.27 GiB 998.87 GB)
Used Dev Size : 487729152 (465.13 GiB 499.43 GB)
Raid Devices : 4
Total Devices : 4
Persistence : Superblock is persistent
...
Update Time : Tue Nov 10 23:54:31 2020
State : clean, resyncing
...
Resync Status : 1% complete
It took about 5 hours to do the initial resync (sheesh!), and once that was done, I ran the benchmarks again:
Test | Result |
---|---|
hdparm | 167.72 MB/s |
dd | 97.4 MB/s |
random 4K read | 0.85 MB/s |
random 4K write | 1.52 MB/s |
It bears repeating:
I'm reminded of https://www.youtube.com/watch?v=gSrnXgAmK8k
Have you tested if you can boot from a drive attached through PCIE?
EDIT: It appears that as of now the Raspberry Pi firmware only supports SD card, USB and Network boot. However, you could potentially boot a U-Boot shell from the SD card, load a efi driver for NVMe drives, then load the OS efi bootloader from the drive. But this appears to be completely untested on raspberry pi although it has been found to work on the Rock Pi (Rockchip ARM, not Broadcom). TianoCore has a more "finished" UEFI implementation on the Raspberry Pi. Unfortunately, the project's NVMe efi driver cannot be built for arm, though TianoCore's UEFI shell may be able to load a driver binary from another project.
@geerlingguy
You should seriously look at using ZFS raidz over mdadm raid
"calculator" https://calomel.org/zfs_raid_speed_capacity.html
Official OpenZFS guide now including installation on Raspberry Pi
Another quick note, just to make sure I point it out: the fastest way to reset the unmounted drives is to run sudo wipefs -a /dev/sd[a-d]
. Don't, uh... do that when you're not certain you want to wipe all the drives though :D
Now this is weird... I kept trying to create an array with 4 SSDs, but kept getting results like:
$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: super1.x cannot open /dev/sda1: Device or resource busy
mdadm: ddf: Cannot use /dev/sda1: Device or resource busy
mdadm: Cannot use /dev/sda1: It is busy
mdadm: cannot open /dev/sda1: Device or resource busy
But sometimes (after doing a reset where I stopped md0, zeroed the drives, and removed md0), it would be sdb. Sometimes sdc. Sometimes sdd. Sometimes more than one, but never the same.
So it looked like a race condition, and lo and behold, searching around, I found this post from 2012: mdadm: device or resource busy, and in it, it is suggested disabling udev events during creation:
$ sudo udevadm control --stop-exec-queue
$ sudo mdadm --create ...
$ sudo udevadm control --start-exec-queue
Lo and behold... that worked!
Some benchmarks for 4 Kingston SSDs (2x 120 GB, 2x 240 GB) below:
Test | Result |
---|---|
hdparm | 296.21 MB/s |
dd | 169.67 MB/s |
random 4K read | 28.33 MB/s |
random 4K write | 61.85 MB/s |
Test | Result |
---|---|
hdparm | 277.14 MB/s |
dd | 116.33 MB/s |
random 4K read | 26.61 MB/s |
random 4K write | 41.82 MB/s |
Note: In RAID 10, I ended up getting a total array size of 240 GB, effectively wasting 120 GB of space that could've been used had I gone with four 240 GB drives. In a real-world NAS setup, I would likely go with 1 or 2 TB drives (heck, maybe even more!), and especially in RAID 1 or 10, always use the same-sized (and ideally exact same model) drives.
Note 2: While monitoring with atop
and sudo mdadm --detail /dev/md0
, I noticed the four drives, while doing their initial sync, were each getting almost identical write speeds of ~100.4 MB/sec, with ~4ms latency. That equates to around 396.8 MB/sec total bus speed... or almost exactly 3.2 Gbps. So the maximum throughput of any RAID array is definitely going to be limited by the Pi's PCIe 1x lane (just like networking).
Note 3: The resync
of the four SSDs is WAAAAAY faster than the HDDs. It helps that they're also spanning a smaller volume (224 GB instead of 930 GB), but the raw IO for the sync I believe is 3-4x faster.
Note 4: The IO Crest card is also WAAAAY toastier, hitting up to 121°C on parts of the PCB (without active ventilation... I'm rectifying that situation now). Yowza! With a fan, it stayed under 90°C (still hot though).
This video will (hopefully) be epic, and still, sadly, won't cover probably more than 50% of what I've learned testing this card. Working on the final script now, hopefully I'll be able to start recording either late tomorrow or early in the week, once I get my notes finished for my Kubernetes 101 series episode!
iperf3
measured 942 Mbps between the Pi's 1 Gbps port and my MacBook Pro through a CalDigit TB3 hub, so the maximum possible transfer rate I could achieve is 118 MB/sec on this connection:
Configuration | Large file copy | Folder copy |
---|---|---|
SMB RAID 10 Kingston SSD x4 | 93.30 MB/sec | 24.56 MB/sec |
NFS RAID 10 Kingston SSD x4 | 106.20 MB/sec | 36.47 MB/sec |
Note: During some of the later NFS file copies, I was hitting 100% busy on one or two of the SSDs (measured via
atop
), and the network interface was also maxing out and gettingksoftirqd
queueing some packets. It happened only for short bursts, but enough to impact longer file copies, and I could also see the system RAM (4 GB in this case) getting full. I'm guessing data is buffered in RAM to be written to disk, and that entire operation can't sustain 1 Gbps full-tilt over long periods.Measuring the temperature of the IOcrest board, it was showing 111°C in the bottom corner, even with my 12V fan at full blast over the board. The temperature didn't seem to affect the queueing though, as it happened even after a shutdown and cooldown cycle (a couple, in fact).
Note 2: It seems like
nfs
is multithreaded by default, and this allows it to saturate the network bandwidth more efficiently.smbd
on the other hand, seems to run one thread that maxes out on one CPU core (at least by default), and that is the primary bottleneck preventing the full network bandwidth to be used in bursts, at least on the Pi which has some IRQ limitations.
# Install Samba.
sudo apt install -y samba samba-common-bin
# Create a shared directory.
sudo mkdir /mnt/raid10/shared-smb
sudo chmod -R 777 /mnt/raid10/shared-smb
# Add the text below to the the bottom of the Samba config.
sudo nano /etc/samba/smb.conf
[shared]
path=/mnt/raid10/shared
writeable=Yes
create mask=0777
directory mask=0777
public=no
# Restart Samba daemon.
pi@raspberrypi:~ $ sudo systemctl restart smbd
# Create a Samba password for the Pi user.
pi@raspberrypi:~ $ sudo smbpasswd -a pi
# (On another computer, connect to smb://[pi ip address])
Example atop
output during peak of file copy using SMB:
# Install NFS.
sudo apt-get install -y nfs-kernel-server
# Create a shared directory.
sudo mkdir /mnt/raid10/shared-nfs
sudo chmod -R 777 /mnt/raid10/shared-nfs
# Add the line below to the bottom of the /etc/exports file
sudo nano /etc/exports
/mnt/raid10/shared-nfs *(rw,all_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)
# Update NFS exports after saving the file.
sudo exportfs -ra
# Connect to server from Mac (⌘-K in Finder):
nfs://10.0.100.119/mnt/raid10/shared-nfs
Example atop
output during peak of file copy using NFS:
Each benchmark was run three times, and the result averaged.
Using 7.35 GB .img file:
pv 2020-08-20-raspios-buster-armhf-full.img > /Volumes/shared-[type]/2020-08-20-raspios-buster-armhf-full.img
Using folder with 1,478 images and video clips totaling 1.93 GB:
time cp -R old-sd-card-backup /Volumes/shared-[type]
Two last things I want to test:
For NFS threads:
# Change RPCNFSDCOUNT from 8 to 1.
sudo nano /etc/default/nfs-kernel-server
# Restart nfsd.
sudo systemctl restart nfs-kernel-server
# Confirm there's now one thread.
ps aux | grep nfsd
And the result? Even with only one thread, I was able to hit 900+ Mbps and sustain 105+ MB/sec with NFS (though the single thread was hitting 75-100% CPU usage on one core now).
So something about the NFS protocol seems to be slightly more efficient than Samba—at least on Linux—in general, regardless of the threading model.
Energy consumption (4x Kingston SSD via dedicated AC adapter + IO Board, CM4, IOCrest card via AC adapter):
One more thing I was wondering—is there a technical reason to partition the drives before adding them to the array (vs. just using sda/sdb/etc.)? This SO answer about creating an array using partitions vs. the whole disk seemed to have a few good arguments in favor of pre-partitioning.
6by9 on the Pi Forums mentioned:
I bought this I/O Crest 4 Port SATA III PCIe card and would like to see if I can get a 4-drive RAID array going:
Relevant Links: