geerlingguy / raspberry-pi-pcie-devices

Raspberry Pi PCI Express device compatibility database
http://pipci.jeffgeerling.com
GNU General Public License v3.0
1.59k stars 144 forks source link

Test SATA adapter (I/O Crest 4 port Marvell 9215) #1

Closed geerlingguy closed 3 years ago

geerlingguy commented 4 years ago

6by9 on the Pi Forums mentioned:

For those wanting to know about PCI-e compatibility, I have one here with a Pericom PI7C9X 1 to 3 way PCI-e bridge, and Marvell 9215 4 port SATA card connected to that. (My VL805 USB3 card is still to be delivered). With a couple of extra kernel modules enabled (mainly CONFIG_ATA, and CONFIG_SATA_AHCI) it's the basis of my next NAS.

<24W with a pair of 8TB SATA drives spinning and a 240GB SSD. <10W with the spinning rust in standby.

I bought this I/O Crest 4 Port SATA III PCIe card and would like to see if I can get a 4-drive RAID array going:

DSC_2840

Relevant Links:

geerlingguy commented 4 years ago

I'm going to try out the IO Crest 4-port SATA adapter.

geerlingguy commented 4 years ago

It has arrived!

geerlingguy commented 4 years ago

And... I just realized I have no SATA power supply cable, just the data cable. So I'll have to wait for one of those to come in before I can actually test one of my SATA drives.

geerlingguy commented 4 years ago

First light is good:

$ lspci

01:00.0 SATA controller: Marvell Technology Group Ltd. Device 9215 (rev 11) (prog-if 01 [AHCI 1.0])
    Subsystem: Marvell Technology Group Ltd. Device 9215
    Control: I/O+ Mem+ BusMaster- SpecCycle- MemWINV- VGASnoop- ParErr+ Stepping- SERR+ FastB2B- DisINTx-
    Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
    Interrupt: pin A routed to IRQ 0
    Region 0: I/O ports at 0000
    Region 1: I/O ports at 0000
    Region 2: I/O ports at 0000
    Region 3: I/O ports at 0000
    Region 4: I/O ports at 0000
    Region 5: Memory at 600040000 (32-bit, non-prefetchable) [size=2K]
    Expansion ROM at 600000000 [size=256K]
    Capabilities: [40] Power Management version 3
        Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0-,D1-,D2-,D3hot+,D3cold-)
        Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=0 PME-
    Capabilities: [50] MSI: Enable- Count=1/1 Maskable- 64bit-
        Address: 00000000  Data: 0000
    Capabilities: [70] Express (v2) Legacy Endpoint, MSI 00
        DevCap: MaxPayload 512 bytes, PhantFunc 0, Latency L0s <1us, L1 <8us
            ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset-
        DevCtl: Report errors: Correctable- Non-Fatal- Fatal- Unsupported-
            RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
            MaxPayload 128 bytes, MaxReadReq 512 bytes
        DevSta: CorrErr- UncorrErr- FatalErr- UnsuppReq- AuxPwr- TransPend-
        LnkCap: Port #0, Speed 5GT/s, Width x1, ASPM L0s L1, Exit Latency L0s <512ns, L1 <64us
            ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp-
        LnkCtl: ASPM Disabled; RCB 64 bytes Disabled- CommClk-
            ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
        LnkSta: Speed 5GT/s, Width x1, TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
        DevCap2: Completion Timeout: Not Supported, TimeoutDis+, LTR-, OBFF Not Supported
        DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis-, LTR-, OBFF Disabled
        LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
             Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
             Compliance De-emphasis: -6dB
        LnkSta2: Current De-emphasis Level: -3.5dB, EqualizationComplete-, EqualizationPhase1-
             EqualizationPhase2-, EqualizationPhase3-, LinkEqualizationRequest-
    Capabilities: [e0] SATA HBA v0.0 BAR4 Offset=00000004
    Capabilities: [100 v1] Advanced Error Reporting
        UESta:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UEMsk:  DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
        UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
        CESta:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr-
        CEMsk:  RxErr- BadTLP- BadDLLP- Rollover- Timeout- NonFatalErr+
        AERCap: First Error Pointer: 00, GenCap- CGenEn- ChkCap- ChkEn-
geerlingguy commented 4 years ago

Though dmesg shows that it's hitting BAR default address space limits again:

[    0.925795] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.925818] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.925884] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x0603ffffff -> 0x00f8000000
[    0.925948] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    0.953526] brcm-pcie fd500000.pcie: link up, 5 GT/s x1 (SSC)
[    0.953827] brcm-pcie fd500000.pcie: PCI host bridge to bus 0000:00
[    0.953844] pci_bus 0000:00: root bus resource [bus 00-ff]
[    0.953866] pci_bus 0000:00: root bus resource [mem 0x600000000-0x603ffffff] (bus address [0xf8000000-0xfbffffff])
[    0.953933] pci 0000:00:00.0: [14e4:2711] type 01 class 0x060400
[    0.954172] pci 0000:00:00.0: PME# supported from D0 D3hot
[    0.957560] PCI: bus0: Fast back to back transfers disabled
[    0.957582] pci 0000:00:00.0: bridge configuration invalid ([bus ff-ff]), reconfiguring
[    0.957802] pci 0000:01:00.0: [1b4b:9215] type 00 class 0x010601
[    0.957874] pci 0000:01:00.0: reg 0x10: [io  0x8000-0x8007]
[    0.957911] pci 0000:01:00.0: reg 0x14: [io  0x8040-0x8043]
[    0.957947] pci 0000:01:00.0: reg 0x18: [io  0x8100-0x8107]
[    0.957984] pci 0000:01:00.0: reg 0x1c: [io  0x8140-0x8143]
[    0.958021] pci 0000:01:00.0: reg 0x20: [io  0x800000-0x80001f]
[    0.958058] pci 0000:01:00.0: reg 0x24: [mem 0x00900000-0x009007ff]
[    0.958095] pci 0000:01:00.0: reg 0x30: [mem 0x00000000-0x0003ffff pref]
[    0.958262] pci 0000:01:00.0: PME# supported from D3hot
[    0.961586] PCI: bus1: Fast back to back transfers disabled
[    0.961605] pci_bus 0000:01: busn_res: [bus 01-ff] end is updated to 01
[    0.961674] pci 0000:00:00.0: BAR 8: assigned [mem 0x600000000-0x6000fffff]
[    0.961698] pci 0000:01:00.0: BAR 6: assigned [mem 0x600000000-0x60003ffff pref]
[    0.961722] pci 0000:01:00.0: BAR 5: assigned [mem 0x600040000-0x6000407ff]
[    0.961744] pci 0000:01:00.0: BAR 4: no space for [io  size 0x0020]
[    0.961759] pci 0000:01:00.0: BAR 4: failed to assign [io  size 0x0020]
[    0.961774] pci 0000:01:00.0: BAR 0: no space for [io  size 0x0008]
[    0.961788] pci 0000:01:00.0: BAR 0: failed to assign [io  size 0x0008]
[    0.961803] pci 0000:01:00.0: BAR 2: no space for [io  size 0x0008]
[    0.961817] pci 0000:01:00.0: BAR 2: failed to assign [io  size 0x0008]
[    0.961831] pci 0000:01:00.0: BAR 1: no space for [io  size 0x0004]
[    0.961845] pci 0000:01:00.0: BAR 1: failed to assign [io  size 0x0004]
[    0.961860] pci 0000:01:00.0: BAR 3: no space for [io  size 0x0004]
[    0.961873] pci 0000:01:00.0: BAR 3: failed to assign [io  size 0x0004]
[    0.961891] pci 0000:00:00.0: PCI bridge to [bus 01]
[    0.961914] pci 0000:00:00.0:   bridge window [mem 0x600000000-0x6000fffff]
[    0.962217] pcieport 0000:00:00.0: enabling device (0140 -> 0142)
[    0.962439] pcieport 0000:00:00.0: PME: Signaling with IRQ 55
[    0.962813] pcieport 0000:00:00.0: AER: enabled with IRQ 55
geerlingguy commented 4 years ago

I just increased the BAR allocation following the directions in this Gist, but when I rebooted (without the card in), I got:

[    0.926161] brcm-pcie fd500000.pcie: host bridge /scb/pcie@7d500000 ranges:
[    0.926184] brcm-pcie fd500000.pcie:   No bus range found for /scb/pcie@7d500000, using [bus 00-ff]
[    0.926247] brcm-pcie fd500000.pcie:      MEM 0x0600000000..0x063fffffff -> 0x00c0000000
[    0.926312] brcm-pcie fd500000.pcie:   IB MEM 0x0000000000..0x00ffffffff -> 0x0100000000
[    1.521386] brcm-pcie fd500000.pcie: link down

Powering off completely, then booting again, it works. So note to self: if you get a link down, try a hard power reset instead of reboot.

geerlingguy commented 4 years ago

Ah... looking closer, those 'failed to assign' errors are for IO BARs, which are unsupported on the Pi.

So... I posted in the BAR space thread on Pi Forums asking 6by9 if that user has had the same logs and if they can be safely ignored. Still waiting on a way to power my drive so I can do an end-to-end test :)

kitlith commented 4 years ago

something else that may be interesting is if you can get a sas adapter/raid card working. I know I was looking into SBCs w/ pcie awhile back for the purpose of building a low power/low heat host for some sas drives I have. (ended up just throwing it in a computer and not running 24/7)

geerlingguy commented 4 years ago

That would be an interesting thing to test, though it'll have to wait a bit as I'm trying to get through some other cards and might also test 2.5 Gbps or 5 Gbps networking if I am able to!

geerlingguy commented 4 years ago

Without the kernel modules enabled, lsblk shows no device:

$ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
mmcblk0     179:0    0 29.8G  0 disk 
├─mmcblk0p1 179:1    0  256M  0 part /boot
└─mmcblk0p2 179:2    0 29.6G  0 part /

Going to try adding those modules and see what happens!

geerlingguy commented 4 years ago
# Install dependencies
sudo apt install -y git bc bison flex libssl-dev make libncurses5-dev

# Clone source
git clone --depth=1 https://github.com/raspberrypi/linux

# Apply default configuration
cd linux
export KERNEL=kernel7l # use kernel8 for 64-bit, or kernel7l for 32-bit
make bcm2711_defconfig

# Customize the .config further with menuconfig
make menuconfig
# Enable the following:
# Device Drivers:
#   -> Serial ATA and Parallel ATA drivers (libata)
#     -> AHCI SATA support
#     -> Marvell SATA support
#
# Alternatively add the following in .config manually:
# CONFIG_ATA=m
# CONFIG_ATA_VERBOSE_ERROR=y
# CONFIG_SATA_PMP=y
# CONFIG_SATA_AHCI=m
# CONFIG_SATA_MOBILE_LPM_POLICY=0
# CONFIG_ATA_SFF=y
# CONFIG_ATA_BMDMA=y
# CONFIG_SATA_MV=m

nano .config
# (edit CONFIG_LOCALVERSION and add a suffix that helps you identify your build)

# Build the kernel and copy everything into place
make -j4 zImage modules dtbs # 'Image' on 64-bit
sudo make modules_install
sudo cp arch/arm/boot/dts/*.dtb /boot/
sudo cp arch/arm/boot/dts/overlays/*.dtb* /boot/overlays/
sudo cp arch/arm/boot/dts/overlays/README /boot/overlays/
sudo cp arch/arm/boot/zImage /boot/$KERNEL.img
geerlingguy commented 4 years ago

Yahoo, it worked!

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    1 223.6G  0 disk 
├─sda1        8:1    1   256M  0 part /media/pi/boot
└─sda2        8:2    1 223.3G  0 part /media/pi/rootfs
mmcblk0     179:0    0  29.8G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part /boot
└─mmcblk0p2 179:2    0  29.6G  0 part /
geerlingguy commented 4 years ago

Repartitioning the drive:

sudo fdisk /dev/sda
d 1    # delete partition 1
d 2    # delete partition 2
n    # create new partition
p    # primary (default)
1    # partition 1 (default)
2048    # First sector (default)
468862127    # Last sector (default)
w    # write new partition table

Got the following:

The partition table has been altered.
Failed to remove partition 1 from system: Device or resource busy
Failed to remove partition 2 from system: Device or resource busy
Failed to add partition 1 to system: Device or resource busy

The kernel still uses the old partitions. The new table will be used at the next reboot. 
Syncing disks.

Rebooted the Pi, then:

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    1 223.6G  0 disk 
└─sda1        8:1    1 223.6G  0 part 
mmcblk0     179:0    0  29.8G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part /boot
└─mmcblk0p2 179:2    0  29.6G  0 part /

To format the device, use mkfs:

$ sudo mkfs.ext4 /dev/sda1
mke2fs 1.44.5 (15-Dec-2018)
Discarding device blocks: done                            
Creating filesystem with 58607510 4k blocks and 14655488 inodes
Filesystem UUID: dd4fa95d-edbf-4696-a9e1-ddf1f17da580
Superblock backups stored on blocks: 
    32768, 98304, 163840, 229376, 294912, 819200, 884736, 1605632, 2654208, 
    4096000, 7962624, 11239424, 20480000, 23887872

Allocating group tables: done                            
Writing inode tables: done                            
Creating journal (262144 blocks): done
Writing superblocks and filesystem accounting information: done 

Then mount it somewhere:

$ sudo mkdir /mnt/sata-sda
$ sudo mount /dev/sda1 /mnt/sata-sda
$ mount
...
/dev/sda1 on /mnt/sata-sda type ext4 (rw,relatime)

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
...
/dev/sda1       220G   61M  208G   1% /mnt/sata-sda
geerlingguy commented 4 years ago

Performance testing of the Kingston SA400S37/240G drive:

Test Result
hdparm 314.79 MB/s
dd 189.00 MB/s
random 4K read 22.98 MB/s
random 4K write 55.02 MB/s

Compare that to the same drive over USB 3.0 using a USB to SATA adapter:

Test Result
hdparm 296.71 MB/s
dd 149.00 MB/s
random 4K read 20.59 MB/s
random 4K write 28.54 MB/s

So not a night-and-day difference like with the NVMe drives, but definitely and noticeably faster. I'm now waiting on another SSD and a power splitter to arrive so I can test multiple SATA SSDs on this card.

And someone just mentioned they have some RAID cards they'd be willing to send me. Might have to pony up for a bunch of hard drives and have my desk turn into some sort of frankemonster NAS-of-many-drives soon!

mo-g commented 4 years ago

I'm curious about other OS's. Obviously, Raspbian is a good basis - but as I recall, Fedora Pi 64-bit uses their own custom kernel. I'd be interested in seeing what they've "left in" from the standard kernel config.

I'm looking forward to picking one of these up in a month or so when they become available to the public, then I'll give it a try!

Side note for your list page - could you include PCI ID's as well as just the brand names of the cards? It'll help avoid confusion where cards have multiple revisions, as well as help non-US users identify comparable cards in their own markets.

Great work in the meantime! :+1:

mi-hol commented 4 years ago

And someone just mentioned they have some RAID cards they'd be willing to send me. Might have to pony up for a bunch of hard drives and have my desk turn into some sort of frankemonster NAS-of-many-drives soon!

It would be great to test a RAID card based on Marvell 88SE9128 chipset, because it is used by many suppliers

geerlingguy commented 4 years ago

Trying again today (but cross-compiling this time since it's oh-so-much faster) now that I have two drives and the appropriate power adapters. I'm planning on just testing a file copy between the drives for now, I'll get into other tests later.

geerlingguy commented 4 years ago

Hmm... putting this on pause. My cross compilation is not dropping in the AHCI module for some reason, probably a bad .config :/

geerlingguy commented 4 years ago

Also, the adapter gets hot after prolonged use.

geerlingguy commented 3 years ago

(For anyone interested in testing on an LSI/IBM SAS card, check out https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/18)

geerlingguy commented 3 years ago

My desk is becoming a war zone:

IMG_2720

Plan is to set up a RAID (probably either 0 if I feel more YOLO-y or 1/10 if I'm more stable-minded) with either 2 or 4 drives, using mdadm.

I was having trouble with the SAS card, not sure if the cards are bad or they just don't work at all with the Pi :(

geerlingguy commented 3 years ago

Testing also with an NVMe using the IO Crest PCIe switch:

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda           8:0    1 223.6G  0 disk 
sdb           8:16   1 223.6G  0 disk 
└─sdb1        8:17   1 223.6G  0 part 
mmcblk0     179:0    0  29.8G  0 disk 
├─mmcblk0p1 179:1    0   256M  0 part /boot
└─mmcblk0p2 179:2    0  29.6G  0 part /
nvme0n1     259:0    0 232.9G  0 disk

I'll post some benchmarks copying files between one of the SSDs and the NVMe; will be interesting to see how many MB/sec they can pump through the switch.

geerlingguy commented 3 years ago

For a direct file copy from one drive to another:

# fallocate -l 10G /mnt/nvme/test.img
# pv /mnt/nvme/test.img > /mnt/sata-sda/test.img

I got an average of 190 MiB/sec, or about 1.52 Gbps. So two-way, that's 3.04 Gbps (under the 3.2 Gbps I was hoping for, but that's maybe down to PCIe switching?

It looks like CPU goes to 99% as SDA takes more than 50% of the CPU—see atop results during a copy:

Screen Shot 2020-11-10 at 9 57 52 AM
geerlingguy commented 3 years ago

Also comparing raw disk speeds through the PCIe switch:

Kingston SSD

Test Result
hdparm 364.23 MB/s
dd 148.00 MB/s
random 4K read 28.89 MB/s
random 4K write 58.01 MB/s

Samsung EVO 970 NVMe

Test Result
hdparm 363.81 MB/s
dd 166.00 MB/s
random 4K read 46.50 MB/s
random 4K write 75.41 MB/s

These were on 64-bit Pi OS... so the numbers are a little higher than the 32-bit Pi OS results from earlier in the thread. But the good news is the PCIe switching seems to not cause any major performance penalty.

geerlingguy commented 3 years ago

Software RAID0 testing using mdadm:

# Install mdadm.
sudo apt install -y mdadm

# Create a RAID0 array using sda1 and sdb1.
sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=2 /dev/sd[a-b]1

# Create a mount point for the new RAID device.
sudo mkdir /mnt/raid0

# Format the RAID device.
sudo mkfs.ext4 /dev/md0

# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid0

Benchmarking the device:

Test Result
hdparm 293.35 MB/s
dd 168.00 MB/s
random 4K read 24.96 MB/s
random 4K write 52.26 MB/s

And during the 4K tests in iozone, I can see the sda/sdb devices are basically getting the same bottlenecks, except with a tiny bit of extra overhead from software-based RAID control:

Screen Shot 2020-11-10 at 10 18 00 AM

Then to stop and remove the RAID0 array:

sudo umount /mnt/raid0
sudo mdadm --stop /dev/md0
sudo mdadm --zero-superblock /dev/sd[a-b]1
sudo mdadm --remove /dev/md0
geerlingguy commented 3 years ago

Software RAID1 (mirrored) testing using mdadm:

# Install mdadm.
sudo apt install -y mdadm

# Create a RAID1 array using sda1 and sdb1.
sudo mdadm --create --verbose /dev/md0 --level=1 --raid-devices=2 /dev/sd[a-b]1

# Create a mount point for the new RAID device.
sudo mkdir /mnt/raid1

# Format the RAID device.
sudo mkfs.ext4 /dev/md0

# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid1

And if you want the RAID device to be persistent:

# Add the following line to the bottom of /etc/fstab:
/dev/md0 /mnt/raid1/ ext4 defaults,noatime 0 1

Configure mdadm to start the RAID at boot:
sudo mdadm --detail --scan | sudo tee -a /etc/mdadm/mdadm.conf

And check on the health of the array:

sudo mdadm --detail /dev/md0

Thanks to Magpi for their article Build a Raspberry Pi NAS.

Benchmarking the device:

Test Result
hdparm 304.63 MB/s
dd 114.00 MB/s
random 4K read 4.83 MB/s
random 4K write 8.43 MB/s

While it was doing the 4K testing on the software RAID1 array, IO ran a bit slower (both sda/sdb were ~100% the whole time or thereabouts):

Screen Shot 2020-11-10 at 10 27 36 AM

The md0_resync process seemed to be the main culprit. Mirroring drives in software RAID seems to be a fairly heavyweight operation when you're writing tons of small files. For large files it didn't seem to be nearly as much of a burden. I ran iozone with a 1024K block size and got 253.63 MB/sec read, 125.70 MB/sec write.

Even at a 128K block size, I got over 100 MB/sec read and write. It really started to slow down around 8K and even 16K block sizes (to ~20 MB/sec), before falling apart at 4K (4-8 MB/sec, as slow as a microSD card!).

geerlingguy commented 3 years ago

Hmm... I'm seeing md0_resync continue to run for a long while after the test. So how are they getting out of sync in the first place? Maybe it is trying to sync data that was already on the drive? I thought I had reformatted them though...

Also seeing a lot in dmesg:

[ 3390.917579] cpu cpu0: dev_pm_opp_set_rate: failed to find current OPP for freq 18446744073709551604 (-34)
[ 3390.917596] raspberrypi-clk soc:firmware:clocks: Failed to change fw-clk-arm frequency: -12

And it looks like the resync is almost complete. I'll run the benchmark again afterwards.

sudo mdadm --detail /dev/md0
/dev/md0:
           Version : 1.2
     Creation Time : Tue Nov 10 16:25:37 2020
        Raid Level : raid1
        Array Size : 234297920 (223.44 GiB 239.92 GB)
     Used Dev Size : 234297920 (223.44 GiB 239.92 GB)
      Raid Devices : 2
     Total Devices : 2
       Persistence : Superblock is persistent

     Intent Bitmap : Internal

       Update Time : Tue Nov 10 16:45:10 2020
             State : clean, resyncing 
    Active Devices : 2
   Working Devices : 2
    Failed Devices : 0
     Spare Devices : 0

Consistency Policy : bitmap

     Resync Status : 95% complete

              Name : raspberrypi:0  (local to host raspberrypi)
              UUID : 19fd4119:91925607:9b4f77f9:56c91824
            Events : 494

    Number   Major   Minor   RaidDevice State
       0       8        1        0      active sync   /dev/sda1
       1       8       17        1      active sync   /dev/sdb1
geerlingguy commented 3 years ago

It looks like the resync was the major issue—now that it's complete, numbers are looking much better:

Test Result
hdparm 351.38 MB/s
dd 114.00 MB/s
random 4K read 27.95 MB/s
random 4K write 43.21 MB/s
geerlingguy commented 3 years ago

What I'd like to test with my 4 spinning disks once I get the rest of my SATA cables in the mail today:

geerlingguy commented 3 years ago

For the spinning disks (500GB WD5000AVDS), I partitioned, formatted, and mounted them, then I ran my benchmarking tests against them:

Test Result
hdparm 72.43 MB/s
dd 67.30 MB/s
random 4K read 0.48 MB/s
random 4K write 0.60 MB/s

Sometimes you forget just how good we have it with flash memory nowadays. These drives are not a great option as boot volumes for the Pi :P

I then put two of them in a RAID0 stripe with mdadm, and ran the same test:

Test Result
hdparm 154.33 MB/s
dd 109.00 MB/s
random 4K read 0.71 MB/s
random 4K write 1.60 MB/s
geerlingguy commented 3 years ago

I also set up SMB:

# Install Samba.
sudo apt install -y samba samba-common-bin

# Create a shared directory.
sudo mkdir /mnt/raid0/shared
sudo chmod -R 777 /mnt/raid0/shared

# Add the text below to the the bottom of the Samba config.
sudo nano /etc/samba/smb.conf

[shared]
path=/mnt/raid0/shared
writeable=Yes
create mask=0777
directory mask=0777
public=no

# Restart Samba daemon.
pi@raspberrypi:~ $ sudo systemctl restart smbd

# Create a Samba password for the Pi user.
pi@raspberrypi:~ $ sudo smbpasswd -a pi

# (On another computer, connect to smb://[pi ip address])

I averaged 75 MB/sec copy performance over the Pi's built-in Gigabit interface for a single large file, 55 MB/sec using rsync with a directory of medium-sized video clips.

geerlingguy commented 3 years ago

Ouch, the initial resync is even slower on these spinny disk drives than it was on the SSDs (which, of course, are half the size in the first place, in addition to being twice as fast). 1% per minute on the sync.

Apparently you could completely skip this option with --assume-clean... but there are many caveats and that's not really intended to happen unless you're in a disaster recovery scenario and you don't want anything to touch the drives when you initialize the RAID device.

So good to know that you should probably plan on letting your array sync up the first time you get it running.

geerlingguy commented 3 years ago

Hmm... now trying all four drives:

$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: RUN_ARRAY failed: Unknown error 524

I then zeroed out the superblock:

sudo mdadm --zero-superblock /dev/sd[a-d]1

But then when I tried to create again, I got:

mdadm: super1.x cannot open /dev/sdd1: Device or resource busy
mdadm: /dev/sdd1 is not suitable for this array.
mdadm: create aborted

So I'm going to reboot and try again. Maybe I have a bad drive 😢

Debugging:

$ cat /proc/mdstat
Personalities : 
md0 : inactive sdd1[3](S)
      488253464 blocks super 1.2

unused devices: <none>

Trying to format it again with fdisk, I got Failed to add partition 1 to system: Invalid argument. Very odd behavior, but I'm thinking there's a good chance this drive is toast. That's what you get for buying refurbished!

geerlingguy commented 3 years ago

No matter what I try, I keep getting mdadm: RUN_ARRAY failed: Unknown error 524 in the end.

geerlingguy commented 3 years ago

Weird. After finding this question on Stack Exchange, I tried:

# echo 1 > /sys/module/raid0/parameters/default_layout

And this time, it works:

$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: chunk size defaults to 512K
mdadm: Defaulting to version 1.2 metadata
mdadm: array /dev/md0 started.

We'll see how much further I can go.

geerlingguy commented 3 years ago

All four drives in RAID0:

Test Result
hdparm 327.32 MB/s
dd 155.00 MB/s
random 4K read 4.46 MB/s
random 4K write 4.71 MB/s

Note: The card is getting HOT:

IMG_0004

geerlingguy commented 3 years ago

Another fun thing I just noticed—ext4lazyinit is still running and making it so I can't unmount the volume without forcing it. If I'm going to repartition and reformat anyways, what's the point of letting it finish?

geerlingguy commented 3 years ago

Resetting the array:

sudo umount /mnt/raid0
sudo mdadm --stop /dev/md0
sudo mdadm --zero-superblock /dev/sd[a-d]1
sudo mdadm --remove /dev/md0

Then set it to RAID 10:

# Install mdadm.
sudo apt install -y mdadm

# Create a RAID10 array using four drives.
sudo mdadm --create --verbose /dev/md0 --level=10 --raid-devices=4 /dev/sd[a-d]1

# Create a mount point for the new RAID device.
sudo mkdir -p /mnt/raid10

# Format the RAID device.
sudo mkfs.ext4 /dev/md0

# Mount the RAID device.
sudo mount /dev/md0 /mnt/raid10

Confirm the RAID 10 drive gives me 1 TB of mirrored/striped storage:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
/dev/md0        915G   77M  869G   1% /mnt/raid1

$ lsblk
NAME        MAJ:MIN RM   SIZE RO TYPE   MOUNTPOINT
sda           8:0    1 465.3G  0 disk   
└─sda1        8:1    1 465.3G  0 part   
  └─md0       9:0    0 930.3G  0 raid10 /mnt/raid1
sdb           8:16   1 465.3G  0 disk   
└─sdb1        8:17   1 465.3G  0 part   
  └─md0       9:0    0 930.3G  0 raid10 /mnt/raid1
sdc           8:32   1 465.3G  0 disk   
└─sdc1        8:33   1 465.3G  0 part   
  └─md0       9:0    0 930.3G  0 raid10 /mnt/raid1
sdd           8:48   1 465.8G  0 disk   
└─sdd1        8:49   1 465.8G  0 part   
  └─md0       9:0    0 930.3G  0 raid10 /mnt/raid1
mmcblk0     179:0    0  29.8G  0 disk   
├─mmcblk0p1 179:1    0   256M  0 part   /boot
└─mmcblk0p2 179:2    0  29.6G  0 part   /

And now the great wait for resync watching sudo mdadm --detail /dev/md0.

Every 2.0s: sudo mdadm --detail /dev/md0                   raspberrypi: Tue Nov 10 23:54:32 2020

/dev/md0:
           Version : 1.2
     Creation Time : Tue Nov 10 23:47:10 2020
        Raid Level : raid10
        Array Size : 975458304 (930.27 GiB 998.87 GB)
     Used Dev Size : 487729152 (465.13 GiB 499.43 GB)
      Raid Devices : 4
     Total Devices : 4
       Persistence : Superblock is persistent
...
       Update Time : Tue Nov 10 23:54:31 2020
             State : clean, resyncing
...
     Resync Status : 1% complete

It took about 5 hours to do the initial resync (sheesh!), and once that was done, I ran the benchmarks again:

Test Result
hdparm 167.72 MB/s
dd 97.4 MB/s
random 4K read 0.85 MB/s
random 4K write 1.52 MB/s
geerlingguy commented 3 years ago

It bears repeating:

4lwbjx

I'm reminded of https://www.youtube.com/watch?v=gSrnXgAmK8k

PixlRainbow commented 3 years ago

Have you tested if you can boot from a drive attached through PCIE?

EDIT: It appears that as of now the Raspberry Pi firmware only supports SD card, USB and Network boot. However, you could potentially boot a U-Boot shell from the SD card, load a efi driver for NVMe drives, then load the OS efi bootloader from the drive. But this appears to be completely untested on raspberry pi although it has been found to work on the Rock Pi (Rockchip ARM, not Broadcom). TianoCore has a more "finished" UEFI implementation on the Raspberry Pi. Unfortunately, the project's NVMe efi driver cannot be built for arm, though TianoCore's UEFI shell may be able to load a driver binary from another project.

markbirss commented 3 years ago

@geerlingguy

You should seriously look at using ZFS raidz over mdadm raid

"calculator" https://calomel.org/zfs_raid_speed_capacity.html

Official OpenZFS guide now including installation on Raspberry Pi

https://openzfs.github.io/openzfs-docs/Getting%20Started/Ubuntu/Ubuntu%2020.04%20Root%20on%20ZFS%20for%20Raspberry%20Pi.html

geerlingguy commented 3 years ago

Another quick note, just to make sure I point it out: the fastest way to reset the unmounted drives is to run sudo wipefs -a /dev/sd[a-d]. Don't, uh... do that when you're not certain you want to wipe all the drives though :D

geerlingguy commented 3 years ago

Now this is weird... I kept trying to create an array with 4 SSDs, but kept getting results like:

$ sudo mdadm --create --verbose /dev/md0 --level=0 --raid-devices=4 /dev/sd[a-d]1
mdadm: super1.x cannot open /dev/sda1: Device or resource busy
mdadm: ddf: Cannot use /dev/sda1: Device or resource busy
mdadm: Cannot use /dev/sda1: It is busy
mdadm: cannot open /dev/sda1: Device or resource busy

But sometimes (after doing a reset where I stopped md0, zeroed the drives, and removed md0), it would be sdb. Sometimes sdc. Sometimes sdd. Sometimes more than one, but never the same.

So it looked like a race condition, and lo and behold, searching around, I found this post from 2012: mdadm: device or resource busy, and in it, it is suggested disabling udev events during creation:

$ sudo udevadm control --stop-exec-queue
$ sudo mdadm --create ...
$ sudo udevadm control --start-exec-queue

Lo and behold... that worked!

geerlingguy commented 3 years ago

Some benchmarks for 4 Kingston SSDs (2x 120 GB, 2x 240 GB) below:

RAID 0

Test Result
hdparm 296.21 MB/s
dd 169.67 MB/s
random 4K read 28.33 MB/s
random 4K write 61.85 MB/s

RAID 10

Test Result
hdparm 277.14 MB/s
dd 116.33 MB/s
random 4K read 26.61 MB/s
random 4K write 41.82 MB/s

Note: In RAID 10, I ended up getting a total array size of 240 GB, effectively wasting 120 GB of space that could've been used had I gone with four 240 GB drives. In a real-world NAS setup, I would likely go with 1 or 2 TB drives (heck, maybe even more!), and especially in RAID 1 or 10, always use the same-sized (and ideally exact same model) drives.

Note 2: While monitoring with atop and sudo mdadm --detail /dev/md0, I noticed the four drives, while doing their initial sync, were each getting almost identical write speeds of ~100.4 MB/sec, with ~4ms latency. That equates to around 396.8 MB/sec total bus speed... or almost exactly 3.2 Gbps. So the maximum throughput of any RAID array is definitely going to be limited by the Pi's PCIe 1x lane (just like networking).

Note 3: The resync of the four SSDs is WAAAAAY faster than the HDDs. It helps that they're also spanning a smaller volume (224 GB instead of 930 GB), but the raw IO for the sync I believe is 3-4x faster.

Note 4: The IO Crest card is also WAAAAY toastier, hitting up to 121°C on parts of the PCB (without active ventilation... I'm rectifying that situation now). Yowza! With a fan, it stayed under 90°C (still hot though).

geerlingguy commented 3 years ago

This video will (hopefully) be epic, and still, sadly, won't cover probably more than 50% of what I've learned testing this card. Working on the final script now, hopefully I'll be able to start recording either late tomorrow or early in the week, once I get my notes finished for my Kubernetes 101 series episode!

geerlingguy commented 3 years ago

iperf3 measured 942 Mbps between the Pi's 1 Gbps port and my MacBook Pro through a CalDigit TB3 hub, so the maximum possible transfer rate I could achieve is 118 MB/sec on this connection:

Configuration Large file copy Folder copy
SMB RAID 10 Kingston SSD x4 93.30 MB/sec 24.56 MB/sec
NFS RAID 10 Kingston SSD x4 106.20 MB/sec 36.47 MB/sec

Note: During some of the later NFS file copies, I was hitting 100% busy on one or two of the SSDs (measured via atop), and the network interface was also maxing out and getting ksoftirqd queueing some packets. It happened only for short bursts, but enough to impact longer file copies, and I could also see the system RAM (4 GB in this case) getting full. I'm guessing data is buffered in RAM to be written to disk, and that entire operation can't sustain 1 Gbps full-tilt over long periods.

Measuring the temperature of the IOcrest board, it was showing 111°C in the bottom corner, even with my 12V fan at full blast over the board. The temperature didn't seem to affect the queueing though, as it happened even after a shutdown and cooldown cycle (a couple, in fact).

Note 2: It seems like nfs is multithreaded by default, and this allows it to saturate the network bandwidth more efficiently. smbd on the other hand, seems to run one thread that maxes out on one CPU core (at least by default), and that is the primary bottleneck preventing the full network bandwidth to be used in bursts, at least on the Pi which has some IRQ limitations.

SMB Setup

# Install Samba.
sudo apt install -y samba samba-common-bin

# Create a shared directory.
sudo mkdir /mnt/raid10/shared-smb
sudo chmod -R 777 /mnt/raid10/shared-smb

# Add the text below to the the bottom of the Samba config.
sudo nano /etc/samba/smb.conf

[shared]
path=/mnt/raid10/shared
writeable=Yes
create mask=0777
directory mask=0777
public=no

# Restart Samba daemon.
pi@raspberrypi:~ $ sudo systemctl restart smbd

# Create a Samba password for the Pi user.
pi@raspberrypi:~ $ sudo smbpasswd -a pi

# (On another computer, connect to smb://[pi ip address])

Example atop output during peak of file copy using SMB:

atop-smb-large-file-copy

NFS Setup

# Install NFS.
sudo apt-get install -y nfs-kernel-server

# Create a shared directory.
sudo mkdir /mnt/raid10/shared-nfs
sudo chmod -R 777 /mnt/raid10/shared-nfs

# Add the line below to the bottom of the /etc/exports file
sudo nano /etc/exports

/mnt/raid10/shared-nfs *(rw,all_squash,insecure,async,no_subtree_check,anonuid=1000,anongid=1000)

# Update NFS exports after saving the file.
sudo exportfs -ra

# Connect to server from Mac (⌘-K in Finder):
nfs://10.0.100.119/mnt/raid10/shared-nfs

Example atop output during peak of file copy using NFS:

atop-nfs-large-file-copy

Benchmark setup

Each benchmark was run three times, and the result averaged.

Large file benchmark

Using 7.35 GB .img file:

pv 2020-08-20-raspios-buster-armhf-full.img > /Volumes/shared-[type]/2020-08-20-raspios-buster-armhf-full.img
Folder with many files benchmark

Using folder with 1,478 images and video clips totaling 1.93 GB:

time cp -R old-sd-card-backup /Volumes/shared-[type]
geerlingguy commented 3 years ago

Two last things I want to test:

geerlingguy commented 3 years ago

For NFS threads:

# Change RPCNFSDCOUNT from 8 to 1.
sudo nano /etc/default/nfs-kernel-server

# Restart nfsd.
sudo systemctl restart nfs-kernel-server

# Confirm there's now one thread.
ps aux | grep nfsd

And the result? Even with only one thread, I was able to hit 900+ Mbps and sustain 105+ MB/sec with NFS (though the single thread was hitting 75-100% CPU usage on one core now).

So something about the NFS protocol seems to be slightly more efficient than Samba—at least on Linux—in general, regardless of the threading model.

geerlingguy commented 3 years ago

Energy consumption (4x Kingston SSD via dedicated AC adapter + IO Board, CM4, IOCrest card via AC adapter):

Screen Shot 2020-11-30 at 11 12 54 AM
geerlingguy commented 3 years ago

One more thing I was wondering—is there a technical reason to partition the drives before adding them to the array (vs. just using sda/sdb/etc.)? This SO answer about creating an array using partitions vs. the whole disk seemed to have a few good arguments in favor of pre-partitioning.