geerlingguy / raspberry-pi-pcie-devices

Raspberry Pi PCI Express device compatibility database
http://pipci.jeffgeerling.com
GNU General Public License v3.0
1.6k stars 144 forks source link

NAS Comparison - ASUSTOR Drivestor 4 Pro vs Pi CM4 #162

Closed geerlingguy closed 2 years ago

geerlingguy commented 3 years ago

So this one should be a bit more interesting...

After seeing my earlier ASUSTOR vs Pi CM4 NAS videos, ASUSTOR sent me their Drivestor 4 Pro - AS3304T, which is even more directly comparable to the CM4-based NAS I built:

banner_model

A quick specs comparison:

Part ASUSTOR Drivestor 4 Pro Pi CM4-based NAS
CPU Realtek RTD1296 Quad-Core 1.4GHz Broadcom BCM2711 Quad-core 1.5GHz
RAM 2 GB DDR4 2400MHz 2/4/8 GB DDR4 3200 MHz
Ethernet 2.5 GbE (Realtek RTL8125 via PCIe 2.0) 1 GbE (+1x 2.5 GbE via PCIe 2.0)
WiFi N/A Optional
I/O 3x USB 3.0 2x USB 3.0, 2x USB 2.0, 2x HDMI 2.0, A/V out
Wake-on-Lan? Yes No
Enclosure + Hot swap bays Yes No
Tool-free install Yes No
OS ADM 4.x Raspberry Pi OS + OMV
Price $329.99 $35 (+ IO Board, + enclosure, + SATA controller, etc.)

Both use ARM architecture CPUs, unlike the Lockerstor 4 that I tested previously (it was AMD64)—and this brings a few wrinkles. It can't do things like run VirtualBox VMs, and some software that may have worked on other models before might not work on this model due to the aarch64 platform.

A few other notable differences with ADM 4.0 (I am still on 3.x on my Lockerstor):

A few questions I'd like to answer:

  1. How do both compare on a 1 Gbps network (e.g. Wiretrustee SATA vs this unit)? Power consumption?
  2. How fast can I get them with a set of SSDs? (I have a bunch of 2 TB SATA SSDs since I retired my old Mac mini...
  3. Compare to AS4004T — seems like the primary difference is a 10 GbE port... but is that Marvell CPU even capable of saturating 10 GbE? Especially if paired up with two 1 Gbps connections too?
  4. How are the drive trays compared to the AS4004T? (Screwless for 3.5" drives, but the brackets are a little finicky and I wouldn't count on them lasting forever).
geerlingguy commented 3 years ago

And this is not directly to be put onto the site, but I'm planning on doing more testing with some PCI Express gear in building another version of the Pi NAS... and might also take a peek at seeing if I could fit the Pi board into the NAS enclosure (which is nice and small, and would be perfect for a Raspberry Pi board!

mi-hol commented 3 years ago

With this setup we should get an apple to apple comparison. Like the approach :)

geerlingguy commented 2 years ago

Setting up RAID 5 on the QVO 8TB SSDs, it looks like the performance here is right in line with what I got on the Radxa Taco—a sync at around 95 MB/sec, which means it's hitting the throughput limit on the PCIe x1 lane in that RTD1296 chip.

Screen Shot 2021-12-01 at 10 24 13 PM

(Noting that on the Lockerstor 4, the sync is going at 194 MB/sec, which seems to indicate double the throughput vs the Drivestor 4.)

geerlingguy commented 2 years ago

One observation about the fan: It seems to hover around 960 rpm at 'low'. I noticed the magnetic front panel covers up all the direct ventilation—it stands off a tiny bit, but that basically negates all the direct airflow.

If I pull it off, I feel a lot more air going between the drives. Probably wouldn't leave that panel on, especially if using large, hot HDDs (I think it looks cooler with it off, too).

But the Lockerstor 4 idles around 500 rpm... it's a bit quieter but I think it's mostly down to lower 'low' rpm.

High speed fan mode is quite loud and sucks down a bit of air :)

I'm considering opening up the case and looking into a nice noctua replacement fan. Maybe.

geerlingguy commented 2 years ago

Contender #2 is going to be the Radxa Taco (which I previously tested in my 48 TB Pi NAS build. I have 4x 4TB Seagate IronWolf NAS drives in it. I'm going to install OMV and see how it fares.

IMG_0169

Kill-a-Watt

First thing to note: all four drives spun up simultaneously (there was no staggered spinup), so I'm sure the initial power surge is kinda hefty... but they all spun up, so at least the board can pull through the power needed. With my Kill-A-Watt, I'm seeing:

OMV and NAS setup

pi@omv:~ $ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE MOUNTPOINT
sda           8:0    0  3.6T  0 disk 
sdb           8:16   0  3.6T  0 disk 
sdc           8:32   0  3.6T  0 disk 
sdd           8:48   0  3.6T  0 disk 
mmcblk0     179:0    0 14.8G  0 disk 
├─mmcblk0p1 179:1    0  256M  0 part /boot
└─mmcblk0p2 179:2    0 14.6G  0 part /
nvme0n1     259:0    0  7.3T  0 disk 

Heh... don't look too closely at that NVMe drive size. I'm going to see about using it as a cache.

Process for bringing up OMV:

  1. Flash 64-bit Pi OS to microSD card.
  2. Boot CM4 4GB Lite module with said card.
  3. Run sudo apt-get update && sudo apt-get upgrade, then sudo reboot
  4. Install OMV: wget -O - https://github.com/OpenMediaVault-Plugin-Developers/installScript/raw/master/install | sudo bash
  5. After install completes, the Pi will automatically reboot.
  6. Visit the OMV UI in the browser (e.g. http://omv.local/), and login with default credentials admin / openmediavault.
  7. Set up a RAID 5 volume in the UI with three drives (yay, the tradition continues of 1/4 of the IronWolf NAS drives being DOA...).
  8. Wait for resync.
  9. Set up bcache — see next comment for progress

Aside: OMV's UI has undergone quite a refresh, though the whole "you have changes to apply" thing that pops under the top of the screen is still annoying. Just save changes when I click save! :P Also, no way to re-start a md raid resync after a reboot in the UI, still have to log in and run sudo mdadm --readwrite /dev/md0 via SSH...

Initial thought for storage array is either use openmediavault-zfs and create RAIDZ1 pool with NVMe SSD as cache, or just do straight up RAID5 and bcache (not sure if that's supported all in the UI in OMV).

Wish I could try out TrueNAS but that's X86 only still (maybe that will change?), for some silly reasons like "everyone on the market currently runs on X86 hardware", pfft.

geerlingguy commented 2 years ago

Setting up bcache:

pi@omv:~ $ sudo apt-get install bcache-tools
...

pi@omv:~ $ sudo make-bcache -B /dev/md0
UUID:           eb360a2d-4c62-451d-8549-a68621c633e5
Set UUID:       c8b5c63c-0a44-49f3-bb65-cd4df9b751a0
version:        1
block_size:     1
data_offset:        16

pi@omv:~ $ sudo make-bcache -C /dev/nvme0n1
UUID:           15bf54e9-be21-4478-b676-a08dad937963
Set UUID:       dea419ba-d795-4566-b01f-bb57fa96eb21
version:        0
nbuckets:       15261770
block_size:     1
bucket_size:        1024
nr_in_set:      1
nr_this_dev:        0
first_bucket:       1

I could make-bcache using bcache-tools, I was thinking bcache might actually be enabled on Pi OS, but looks like it's not. Would need a kernel recompile.

My idea is:

  1. Set up RAID 5 + NVMe cache by creating a bcache volume.
  2. Benchmark with that volume (write through? Or just read cache?)
  3. Detach the cache drive
  4. Benchmark without that volume.
geerlingguy commented 2 years ago

Time to recompile the kernel! Here's my menuconfig changes:

# First, enable bcache.
> Device Drivers
  > Multiple devices driver support (RAID and LVM)
    > Block device as cache (BCACHE)

# Second, add in RTL8125 2.5G Ethernet support.
Device Drivers
  > Network device support
    > Ethernet driver support
      > Realtek devices
        > Realtek 8169/8168/8101/8125 ethernet support

Note: The RTL8125 support is already enabled upstream in the latest Pi OS kernel. Just hasn't made its way down to the default Pi OS distro/image kernel yet :( — you can also install the driver manually instead of recompiling the kernel, if you just need 2.5G support.

Recompiling now...

geerlingguy commented 2 years ago

(Aside: I just noticed OMV must take over control of /etc/ssh/sshd_config because it was wiped clean and has some rather insecure defaults, like PermitRootLogin yes, and no prohibit-password, along with PasswordAuthentication yes)

geerlingguy commented 2 years ago
pi@omv:~ $ sudo mkfs.ext4 /dev/bcache0
...

pi@omv:~ $ sudo mount /dev/bcache0 /mnt

pi@omv:~ $ lsblk
NAME        MAJ:MIN RM  SIZE RO TYPE  MOUNTPOINT
sda           8:0    0  3.6T  0 disk  
└─md0         9:0    0  7.3T  0 raid5 
  └─bcache0 254:0    0  7.3T  0 disk  /mnt
sdb           8:16   0  3.6T  0 disk  
└─md0         9:0    0  7.3T  0 raid5 
  └─bcache0 254:0    0  7.3T  0 disk  /mnt
sdc           8:32   0  3.6T  0 disk  
└─md0         9:0    0  7.3T  0 raid5 
  └─bcache0 254:0    0  7.3T  0 disk  /mnt
mmcblk0     179:0    0 14.8G  0 disk  
├─mmcblk0p1 179:1    0  256M  0 part  /boot
└─mmcblk0p2 179:2    0 14.6G  0 part  /
nvme0n1     259:0    0  7.3T  0 disk  

pi@omv:~ $ cat /sys/block/bcache0/bcache/state
no cache

When I try to attach the NVMe drive, I get:

pi@omv:~ $ sudo make-bcache -C /dev/nvme0n1
Can't open dev /dev/nvme0n1: Device or resource busy
geerlingguy commented 2 years ago

I had to unregister nvme0n1 following the directions in the bcache documentation under "Remove or replace a caching device".

Then I reattached it:

# umount /mnt

# make-bcache -C /dev/nvme0n1
UUID:           6d9e32ad-498a-4fe5-a0b7-86a66d01aaa6
Set UUID:       9e59a381-40cb-43d9-bd66-c77586977759
version:        0
nbuckets:       15261770
block_size:     1
bucket_size:        1024
nr_in_set:      1
nr_this_dev:        0
first_bucket:       1

# cd /sys/block/md0/bcache/
# echo 9e59a381-40cb-43d9-bd66-c77586977759 > attach
# cat state 
clean

# mount /dev/bcache0 /mnt

(I think I just had never attached the cache device properly in the first place...).

Bcache tips:

# Get stats
tail /sys/block/bcache0/bcache/stats_total/*
geerlingguy commented 2 years ago

Since I don't want to forget any of this, I wrote up a guide: Use bcache for SSD caching on a Raspberry Pi.

geerlingguy commented 2 years ago

Apparently if you set up a volume the way I did via the CLI, OMV won't see it, and you can't manage it via the UI. Oopsie! Going to set up a Samba share via CLI.

geerlingguy commented 2 years ago

I wonder if OMV causes some strange state to occur with the network controller—it seemed to take over interface management, I had to add the 2.5G interface in OMV's UI:

Screen Shot 2021-12-17 at 10 16 48 AM

And unlike my testing on Pi OS directly, I seemed to be hitting IRQ interrupts maxing out a CPU core on the 2.5G connection, limiting the bandwidth to 1.88 Gbps:

pi@omv:~ $ iperf3 -c 10.0.100.100
Connecting to host 10.0.100.100, port 5201
[  5] local 10.0.100.199 port 42720 connected to 10.0.100.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   220 MBytes  1.84 Gbits/sec    0    501 KBytes       
[  5]   1.00-2.00   sec   225 MBytes  1.89 Gbits/sec    0    501 KBytes       
[  5]   2.00-3.00   sec   225 MBytes  1.89 Gbits/sec    0    523 KBytes       
[  5]   3.00-4.00   sec   225 MBytes  1.89 Gbits/sec    0    549 KBytes       
[  5]   4.00-5.00   sec   226 MBytes  1.89 Gbits/sec    0    549 KBytes       
[  5]   5.00-6.00   sec   225 MBytes  1.88 Gbits/sec    0    576 KBytes       
[  5]   6.00-7.00   sec   225 MBytes  1.89 Gbits/sec    0    576 KBytes       
[  5]   7.00-8.00   sec   225 MBytes  1.89 Gbits/sec    0    576 KBytes       
[  5]   8.00-9.00   sec   226 MBytes  1.90 Gbits/sec    0    576 KBytes       
[  5]   9.00-10.00  sec   225 MBytes  1.89 Gbits/sec    0    576 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.19 GBytes  1.88 Gbits/sec    0             sender
[  5]   0.00-10.00  sec  2.19 GBytes  1.88 Gbits/sec                  receiver

I wonder if I should just ditch OMV for the testing :/

geerlingguy commented 2 years ago

Disk Benchmarks

Using my disk-benchmark.sh script.

With bcache enabled, writeback mode:

Benchmark Result
fio 1M sequential read 345 MB/s
iozone 1M random read 366.80 MB/s
iozone 1M random write 373.88 MB/s
iozone 4K random read 40.00 MB/s
iozone 4K random write 53.11 MB/s

With bcache disabled, none mode:

Benchmark Result
fio 1M sequential read 354 MB/s
iozone 1M random read 75.02 MB/s
iozone 1M random write 30.87 MB/s
iozone 4K random read 1.35 MB/s
iozone 4K random write 0.36 MB/s

SMB Network Copy Tests

Using my rsync network copy test.

With bcache enabled, writeback mode:

# Mac to Pi (write)
sent 8.59G bytes  received 35 bytes  70.72M bytes/sec
total size is 8.59G  speedup is 1.00

# Pi to Mac (read)
sent 8.59G bytes  received 35 bytes  110.86M bytes/sec
total size is 8.59G  speedup is 1.00

With bcache disabled, none mode:

# Mac to Pi (write)
sent 8.59G bytes  received 35 bytes  70.14M bytes/sec
total size is 8.59G  speedup is 1.00

# Pi to Mac (read)
sent 8.59G bytes  received 35 bytes  105.42M bytes/sec
total size is 8.59G  speedup is 1.00

PCIe Bus / Networking Benchmark

(Used iperf3 -c 10.0.100.100 -tinf and sudo fio --filename=/dev/bcache0 --direct=1 --rw=read --bs=1024k --ioengine=libaio --iodepth=64 --size=16G --runtime=120 --numjobs=4 --group_reporting --name=fio-rand-read-sequential --eta-newline=1 --readonly).

geerlingguy commented 2 years ago

I think I have the data I want from the Taco, on to the ASUSTOR!

Since it seems like some of the apps like iperf3 in App Central aren't available on aarch64/ARM64, I installed docker-ce via App Central and ran tests through there.

Kill-A-Watt

Disk Benchmarks

These benchmarks were run inside Docker container with: docker run -it -v /dev/md1:/dev/md1 -v /volume1/home/admin:/volume1/home/admin --privileged debian:bullseye /bin/bash, then I downloaded my disk-benchmark.sh script (modified to remove sudo references), and ran it with: DEVICE_UNDER_TEST=/dev/md1 DEVICE_MOUNT_PATH=/volume1/home/admin ./disk-benchmark.sh.

Benchmark Result
fio 1M sequential read 251 MB/s
iozone 1M random read 62.55 MB/s
iozone 1M random write 110.13 MB/s
iozone 4K random read 1.47 MB/s
iozone 4K random write 3.72 MB/s

SMB Network Copy Tests

Using my rsync network copy test.

# Mac to ASUSTOR (write)
sent 8.59G bytes  received 35 bytes  89.97M bytes/sec
total size is 8.59G  speedup is 1.00

# ASUSTOR to Mac (read)
sent 8.59G bytes  received 35 bytes  144.40M bytes/sec
total size is 8.59G  speedup is 1.00

Note: writes on the ASUSTOR were more consistent, with little fluctuation or 'dead times' when it seemed interrupts were stacked up and queues/caches were clearing. Also, I re-tested a couple giant copies with large video folders to confirm the speeds, and they seemed consistent with rsync measurements in the Terminal.

PCIe Bus / Networking Benchmark

For fio via Docker: Used docker run -it -v /dev/md1:/dev/md1 --privileged manjo8/fio fio --filename=/dev/md1 --direct=1 --rw=read --bs=1024k --ioengine=libaio --iodepth=64 --size=8G --runtime=60 --numjobs=4 --group_reporting --name=fio-rand-read-sequential --eta-newline=1 --readonly — confirmed md1 was proper volume with mdadm -D /dev/md1).

For iperf3 via Docker: Used docker run -it ajoergensen/iperf3 -c 10.0.100.100.

geerlingguy commented 2 years ago

Annoyances benchmarking the ASUSTOR:

geerlingguy commented 2 years ago

Hmm... looking at my performance numbers and comparing everything back to the Taco: https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/268#issuecomment-965555364 — in that thread I used the RTL driver from Realtek's website, instead of the kernel module that I compiled by hand from the Pi linux tree... and I got 2.35 Gbps...

So maybe I need to do a little re-testing using the driver instead of the kernel driver. Maybe Realtek's driver has some other optimizations that are in the 5.11/12/13/14/15 source that aren't in 5.10?

I was also getting 80 MB/s writes and 125.43 MB/s reads on the Taco with the RTL driver instead of the in-kernel driver, which is faster than the 70/110 I got here. All these numbers seem to be 15-25% better with Realtek's driver :/

geerlingguy commented 2 years ago

Getting Realtek 2.5G NIC working using it's driver instead of the one in the 5.10 kernel source:

  1. Download the 2.5G Ethernet LINUX driver r8125 for kernel up to 5.6 driver version 9.007.01 (had to solve an annoying math captcha first).

  2. Install kernel headers: sudo apt-get install -y raspberrypi-kernel-headers

  3. Run:

     tar vjxf r8125-9.007.01.tar.bz2
     cd r8125-9.007.01/
     sudo ./autorun.sh
  4. Install iperf3 again: sudo apt install -y iperf3

  5. Run test:

pi@taco:~ $ iperf3 -c 10.0.100.100
Connecting to host 10.0.100.100, port 5201
[  5] local 10.0.100.49 port 53116 connected to 10.0.100.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   274 MBytes  2.30 Gbits/sec    0    782 KBytes       
[  5]   1.00-2.00   sec   281 MBytes  2.36 Gbits/sec    0    782 KBytes       
[  5]   2.00-3.00   sec   280 MBytes  2.35 Gbits/sec    0    782 KBytes       
[  5]   3.00-4.00   sec   281 MBytes  2.35 Gbits/sec    0    782 KBytes       
[  5]   4.00-5.00   sec   280 MBytes  2.35 Gbits/sec    0    865 KBytes       
[  5]   5.00-6.00   sec   281 MBytes  2.35 Gbits/sec    0    865 KBytes       
[  5]   6.00-7.00   sec   280 MBytes  2.35 Gbits/sec    0    913 KBytes       
[  5]   7.00-8.00   sec   281 MBytes  2.35 Gbits/sec    0    913 KBytes       
[  5]   8.00-9.00   sec   280 MBytes  2.35 Gbits/sec    0    913 KBytes       
[  5]   9.00-10.00  sec   280 MBytes  2.35 Gbits/sec    0    913 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.73 GBytes  2.35 Gbits/sec    0             sender
[  5]   0.00-10.01  sec  2.73 GBytes  2.34 Gbits/sec                  receiver

(2.2 Gbps in opposite direction.)

Well I'll be! Going to have to rebuild RAID 5 array and re-test everything, drat.

geerlingguy commented 2 years ago

I pulled the array out of the Drivestor 4 Pro, plugged it straight into the Taco, then used mdadm to recover and run the array on the Pi as /dev/md0:

pi@taco:~ $ sudo mdadm -A /dev/md0 /dev/sd{a,b,c}4 --run
mdadm: /dev/md0 has been started with 3 drives.

pi@taco:~ $ sudo mount /dev/md0 /mnt

pi@taco:~ $ ls /mnt
aquota.user  home  lost+found  Public  Web

Going to re-run some tests now.

geerlingguy commented 2 years ago

SMB Network Copy

Re-testing Taco SMB copy tests with bcache disabled:

# Mac to Pi (write)
sent 8.59G bytes  received 35 bytes  75.70M bytes/sec
total size is 8.59G  speedup is 1.00

# Pi to Mac (read)
sent 8.59G bytes  received 35 bytes  97.09M bytes/sec
total size is 8.59G  speedup is 1.00

With bcache enabled (TODO).

PCIe Bus / Networking Benchmark

(Used iperf3 -c 10.0.100.100 -tinf and sudo fio --filename=/dev/md0 --direct=1 --rw=read --bs=1024k --ioengine=libaio --iodepth=64 --size=16G --runtime=120 --numjobs=4 --group_reporting --name=fio-rand-read-sequential --eta-newline=1 --readonly).

geerlingguy commented 2 years ago

More discussion on https://github.com/raspberrypi/linux/issues/4133 — seems like some strange things afoot with samba performance on the Pi, not sure what's going on there, but it should be faster.

I was also trying out the bcmstat script by @MilhouseVH today, found one little snafu... https://github.com/MilhouseVH/bcmstat/issues/23

geerlingguy commented 2 years ago

On the Taco / Pi OS, I just rebuilt the kernel with the in-tree driver in rpi-5.15.y linux, and after a reboot:

pi@taco15:~ $ uname -a
Linux taco15 5.15.10-v8+ #1 SMP PREEMPT Mon Dec 20 03:16:02 UTC 2021 aarch64 GNU/Linux

pi@taco15:~ $ dmesg | grep r8169
[    4.714152] r8169 0000:04:00.0: enabling device (0000 -> 0002)
[    4.783518] r8169 0000:04:00.0: can't read MAC address, setting random one
[    4.818143] libphy: r8169: probed
[    4.822214] r8169 0000:04:00.0 eth1: RTL8125B, 9e:57:a0:b7:ee:01, XID 641, IRQ 73
[    4.822247] r8169 0000:04:00.0 eth1: jumbo features [frames: 9194 bytes, tx checksumming: ko]
[    6.791432] RTL8226B_RTL8221B 2.5Gbps PHY r8169-0-400:00: attached PHY driver (mii_bus:phy_addr=r8169-0-400:00, irq=MAC)
[    6.991585] r8169 0000:04:00.0 eth1: Link is Down
[   47.452192] r8169 0000:04:00.0 eth1: Link is Up - 2.5Gbps/Full - flow control rx/tx

pi@taco15:~ $ iperf3 -c 10.0.100.100
Connecting to host 10.0.100.100, port 5201
[  5] local 10.0.100.105 port 40554 connected to 10.0.100.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   164 MBytes  1.37 Gbits/sec    0    171 KBytes       
[  5]   1.00-2.00   sec   168 MBytes  1.40 Gbits/sec    0    171 KBytes       
[  5]   2.00-3.01   sec   168 MBytes  1.40 Gbits/sec    0    182 KBytes       
[  5]   3.01-4.00   sec   168 MBytes  1.41 Gbits/sec    0    182 KBytes       
[  5]   4.00-5.00   sec   169 MBytes  1.41 Gbits/sec    0    182 KBytes       
[  5]   5.00-6.01   sec   169 MBytes  1.41 Gbits/sec    0    201 KBytes       
[  5]   6.01-7.00   sec   166 MBytes  1.40 Gbits/sec    0    273 KBytes       
[  5]   7.00-8.01   sec   168 MBytes  1.40 Gbits/sec    0    273 KBytes       
[  5]   8.01-9.00   sec   165 MBytes  1.38 Gbits/sec    0    273 KBytes       
[  5]   9.00-10.00  sec   165 MBytes  1.38 Gbits/sec    0    273 KBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  1.63 GBytes  1.40 Gbits/sec    0             sender
[  5]   0.00-10.01  sec  1.63 GBytes  1.40 Gbits/sec                  receiver

Very strange—I wonder if something in the rpi kernel fork for 5.15 is screwed up in terms of PCIe or networking that's fixed up in the 5.10 kernel? I did a clean clone of the repo and checked out the tip of 5.15.y.

geerlingguy commented 2 years ago

DSC05178

I also popped apart the ASUSTOR Drivestor 4 and explored it's innards:

geerlingguy commented 2 years ago

Video and blog post are coming up tomorrow :)

geerlingguy commented 2 years ago

Video and blog post are up:

formvoltron commented 2 years ago

Can the ASUSTOR Drivestor 4 handle a 2 drive failure?

geerlingguy commented 2 years ago

@formvoltron - That depends on which two drives and which RAID type you have set up :)

formvoltron commented 2 years ago

I was thinking of a raid 5 level.

On Wed, Dec 22, 2021 at 10:35 AM Jeff Geerling @.***> wrote:

@formvoltron https://github.com/formvoltron - That depends on which two drives and which RAID type you have set up :)

— Reply to this email directly, view it on GitHub https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/162#issuecomment-999665682, or unsubscribe https://github.com/notifications/unsubscribe-auth/AADE3FE6QCHO7USGZRKHMM3USHV5FANCNFSM47OUOKZQ . You are receiving this because you were mentioned.Message ID: @.***>

geerlingguy commented 2 years ago

@formvoltron - With RAID 5, you can have one drive failure. You have to replace the failed drive and wait for it to be incorporated into the degraded array. RAID 6 or RAID 10 are safer if you are worried about more than one drive failure. RAID 5 is often not recommended for very large hard drives nowadays.

formvoltron commented 2 years ago

First I'd heard of Raid 6. Reading up on it it sounds like exactly what I'd want for ultra reliable storage: handling 2 drive failures. Thank you for the excellent YT vid & review. If I were a younger man I'd go for the pi. But seeing that I'm greying and crotchety I'd certainly opt for the ready made NAS.

ThomasKaiser commented 2 years ago

Wrt your benchmarking:

geerlingguy commented 2 years ago

@ThomasKaiser - For the benchmark monitoring, I ran the tests in three different conditions: 1. monitoring with atop at 2s intervals, 2. monitoring with top in 2s intervals, and 3. not running any tool to monitor resource usage at all during the benchmark. I do that third test because even though it should be minimal, injecting the calls to get monitoring data could conceivably impact performance (no matter how minimally).

Nothing else was running during the benchmarks (I actually ran them all again without OMV installed at all) besides what is preinstalled in the lite Pi OS 64-bit image.

For the network file copy I used this rsync command, and I also set up a separate comparison (that I didn't fully document in this issue) where I did the following:

  1. Took a 36 GB folder with about 1,200 files in it—one of my video projects.
  2. Started a screen recording in iShowU Instant on my mac at 60fps.
  3. Measured the time it took from dropping the file into the NAS in the finder to the time the copy progress dialog disappeared. Using that time and the exact folder size in bytes to get a MB/s rating.
  4. Did the same copy operation using rsync to copy the folder.

And the final result between rsync and Finder were within about 1s of each other (which surprised me... it felt like rsync was slower, just watching my network graph in iStatMenus). I repeated that test twice.

I haven't done any more advanced checking of the SMB connection details—but it seems other people in the Pi community have noticed similar issues with samba file copies not being as fast as they were a year or two ago.

ThomasKaiser commented 2 years ago

As for the rsync command and you mentioning there's no easy way to time Finder copies. Here's a q&d script I wrote for a colleague a year ago:

(** The server path – in this case /Volumes/Server/lantest/ - needs to be world
writeable and there need to be 2 files inside named 100M and 1G with appropriate
size. They must not consist of zeros but true data, e.g. from /dev/urandom **)

set FileSize to "100M"
-- set FileSize to "1G"
set OriginalFile to "Server:lantest:" & FileSize
set DestinationFolder to (path to desktop)
set StartTime to do shell script "perl -MTime::HiRes=time -e 'printf \"%.1f\", time'"
with timeout of 100000 seconds
    tell application "Finder"
        duplicate (OriginalFile as alias) to (DestinationFolder as alias) with replacing
    end tell
end timeout
set EndTime to do shell script "perl -MTime::HiRes=time -e 'printf \"%.1f\", time'"
set DownTimeDifference to do shell script "echo " & EndTime & " - " & StartTime & " | bc"
with timeout of 100000 seconds
    tell application "Finder"
        duplicate (((DestinationFolder as string) & FileSize) as alias) to ("Server:lantest:" as alias) with replacing
    end tell
end timeout
set StartTime to do shell script "perl -MTime::HiRes=time -e 'printf \"%.1f\", time'"
set UpTimeDifference to do shell script "echo " & StartTime & " - " & EndTime & " | bc"
set LogMessage to ((DownTimeDifference as string) & " Sec down, " & UpTimeDifference as string) & " Sec up."
log LogMessage
set the clipboard to LogMessage

The requirement for 'real data' instead of zeroes as you do it with mkfile was due to testing through different VPN solutions with different compression algorithms / efficiency.

But still using either Finder or rsync misses the fact that you want to know which block sizes are used when doing sequential transfer tests. As already mentioned: using Helios Lantest is a good idea for this.

ThomasKaiser commented 2 years ago

Wrt monitoring. Yep, both atop and top can add significant load and AFAIK neither tool shows real CPU clockspeeds (just relative CPU utilisation which is somewhat meaningless since a Pi busy at 100% at 600 MHz has less computing power compared to a Pi at 50% / 1800 MHz)

FlyingHavoc commented 2 years ago

Wrt monitoring. Yep, both atop and top can add significant load and AFAIK neither tool shows real CPU clockspeeds (just relative CPU utilisation which is somewhat meaningless since a Pi busy at 100% at 600 MHz has less computing power compared to a Pi at 50% / 1800 MHz)

You can try new utility btop which is a great replacement for top + extra features (C++ version; Binaries for Linux are statically compiled) https://github.com/aristocratos/btop

Using this tool, you can see the actual CPU clock speed on RPI4 too.

ThomasKaiser commented 2 years ago

You can try new utility btop

Has this utility support for querying ThreadX on the RPi? Otherwise it's rather useless here.

Even if btop tries to display the actual CPU clockspeeds on all Raspberries you can't this doing 'the Linux way' (relying on numbers from sysfs) but you need to use a mailbox interface to ask the main OS (that's an RTOS called ThreadX running on the VideoCore CPU cores and not on the ARM cores). The Linux kernel on RPi has no idea at which clockspeeds the ARM cores are running. That's why monitoring this is also important.

Sample output from sbc-bench -m on a RPi 3B+:

Time        fake/real   load %cpu %sys %usr %nice %io %irq   Temp   VCore
17:35:30: 1570/1200MHz  4.26  20%   0%  17%   0%   1%   0%  64.5°C  1.3312V
17:36:00: 1570/1200MHz  3.92  72%   1%  71%   0%   0%   0%  69.3°C  1.3312V
17:36:30: 1570/1200MHz  3.56  82%   1%  81%   0%   0%   0%  70.9°C  1.3312V
17:37:01: 1570/1200MHz  3.88  89%   1%  87%   0%   0%   0%  73.1°C  1.3312V
17:37:31: 1570/1200MHz  3.63  75%   1%  74%   0%   0%   0%  72.5°C  1.3312V

(the user thinking he's done some nasty overclocking as the 1570 MHz are reported by all Linux tools while in reality ThreadX silently clocked the ARM cores down to 1.2GHz)

geerlingguy commented 2 years ago

@ThomasKaiser - There are two different goals in benchmarking, I think—and I am usually targeting a different goal in my tests than I think you may be.

First goal: what kind of performance can be reasonably expected doing end-user tasks on a system set up by an end user, like dragging a file to a NAS in the Finder, or synchronizing two directories on the command line with rsync.

Second goal: What is the reasonably stable measurable performance you can get with a known baseline.

I think your suggestions would help with the second, but my target is usually the first. Ideally you can meet both goals to give a full picture, but the tests that went into my review/comparison were more targeting the first, and I didn't take the time to target the second.

(And in reality, the two goals are usually mixed/intertwined a bit.)

I normally want whatever numbers I show people on screen and in my public blog posts to reflect the ground truth of what they'd get if they followed a tutorial and got all the default stuff running, then dragged one of their own files or folders over to a NAS.

There's definitely room for both numbers, though—and that's why I love digging into deeper articles from sites like anandtech, and benchmarks from you :)

(I just wanted to make that clear—I am always interested in expanding the benchmarks I run and being able to have a better understanding of surprising 'real world' numbers like those I've been seeing on the Pi with Samba.)

mi-hol commented 2 years ago
  • noticed that a bunch of former optimisations that were in my 'OMV for SBC install script' have disappeared in the meantime...

@ThomasKaiser could you please elaborate on what optimizations you mean?

ThomasKaiser commented 2 years ago

@mi-hol just a quick list:

Anyway, the most important bit are smb.conf settings (check with testparm):

min receivefile size = 16384
socket options = TCP_NODELAY IPTOS_LOWDELAY
use sendfile = Yes
getwd cache = yes
write cache size = 524288

The stuff below doesn't help with benchmarks but in real-life NAS situations when the ondemand governor is choosen since it helps keeping the CPU cores at high clockspeeds when the client is the bottleneck (e.g. copying thousands of small files):

echo 1 >/sys/devices/system/cpu/cpufreq/ondemand/io_is_busy
echo 25 >/sys/devices/system/cpu/cpufreq/ondemand/up_threshold
echo 10 >/sys/devices/system/cpu/cpufreq/ondemand/sampling_down_factor
echo 200000 >/sys/devices/system/cpu/cpufreq/ondemand/sampling_rate

(it can take twice as long without io_is_busy set to 1 but honestly I haven't looked into this for years after I added this stuff to Armbian).

Just did a quick check with my RPi 4 and Buster, 5.10.63-v8+ (aarch64), an armhf userland and Samba 4.9.5-Debian:

Bildschirmfoto 2021-12-23 um 15 52 07 klein

Nothing to complain about. Getting 90/100 MB/s with a single-threaded SMB copy with just 1MB block size is totally fine.

As already mentioned, block sizes matter. Quick testing through 1M, 4M and 16M:

Command line used: iozone -e -I -a -s 500M -r 1024k -r 4096k -r 16384k -i 0 -i 1
Output is in kBytes/sec
Time Resolution = 0.000001 seconds.
Processor cache size set to 1024 kBytes.
Processor cache line size set to 32 bytes.
File stride size set to 17 * record size.

    kB  reclen    write  rewrite    read    reread
512000    1024    89421    88773   105749  9784603
512000    4096   102437    86328   105403  8770580
512000   16384   110293   111368   106379 10872846

And this is the stuff Finder and Windows Explorer do on their own. They auto-tune settings and increase block sizes more and more until there's no further benefit. Also they start multiple copies in parallel. As such it's useless to test for 'NAS performance' if you don't take this into account or at least check it. You might be testing one NAS with a small block size and the other with a huge one and tell your audience afterwards the difference would be caused by hardware (something that happens all the time with kitchen-sink benchmarking).

Speaking of Finder... using the AppleScript for Jeff above it's with a file created by dd if=/dev/urandom of=1G bs=1M count=1024 on the RPi as follows (3 consecutive tests):

9.4 Sec down, 10.0 Sec up.
9.4 Sec down, 10.0 Sec up.
9.4 Sec down, 9.9 Sec up. 

Bildschirmfoto 2021-12-23 um 15 52 52 klein

~100 MB/sec in both directions. Fine with me especially since the switch in between is some crappy ALLNET thingy that is the oldest GbE gear lying here around.

ThomasKaiser commented 2 years ago

There are two different goals in benchmarking

True, there's passive benchmarking (also called generating/collecting numbers and graphs for a target audience that wants some entertainment) and there's active benchmarking which means a) getting a clue why numbers are as they are and b) getting an idea how to improve numbers.

As an example of passive benchmarking gone wrong: In your RPi Zero 2 W review you reported 221 Mbps maximum transfer rates for wired Ethernet. No idea how you generated that number but most probably you did not benchmark the Pi but your USB Ethernet dongle that is likely based on an ASIX AX88179 and not the only reasonable choice for the job: RTL8153B?

Reporting that 221 Mbps number matches the 'what kind of performance can be reasonably expected doing end-user tasks on a system set up by an end user' expectation since they could end up buying the 'wrong' Ethernet dongle.

But wouldn't it be better if said end-users learn that there are huge differences with those dongles and there's no need to stick with such low numbers since any thingy based on the RealTek chipset achieves ~100 Mbps more? :)

geerlingguy commented 2 years ago

@ThomasKaiser - I believe you're implicating I'm throwing out meaningless numbers... but that's simply not the case. Unlike 99% of reviewers/entertainers, I thoroughly document every step in my process, every part number I use and test, every system I test on, and every command or script I run, so at a minimum you can reproduce my exact number (and I often do, dozens of times, before publishing any result).

That does not mean my 'entertainment' numbers are incorrect, or wrong. It may mean they are incomplete, or don't paint the whole picture when it comes to benchmarking—that's fine with me.

But they're not wrong ;)

Edit: Additionally, in my latest video, I did explicitly mention I'm not sure why the numbers have changed in the default kernel and OS configurations that ship with Pi OS / Debian.

FlyingHavoc commented 2 years ago

Great content, Jeff! However I do agree with Thomas that usb ethernet adapter chipset might be important here too... It's like with your findings for UAS mode for RPI4, where JMicron was causing all the different issues compared to ASM1153

geerlingguy commented 2 years ago

@FlyingHavoc - No doubt, and there are people who dive deep into testing every single chipset out there (and I'm actually doing something approaching that in this particular project, but only via PCIe, not USB-to-whatever)... but I have limited time and budget, so ideally other people can also do the tests and the information can be promulgated through GitHub issues, forum posts, blog posts, etc.

It would be great if there were more central resources (like my page for PCIe cards on the Pi) for USB chipset support for Network, SATA, etc., but there just isn't, so when I do my testing, I have to work within my means.

I basically have gone as far as I can this year, literally spending over $10,000 on different devices, test equipment, etc., and it's obvious (especially from the few comments above) that that is nowhere near enough to paint a complete picture of every type of device I test.

There are groups like the Linus Media Group investing hundreds of thousands (possibly millions) of dollars into benchmarking in a new lab, and they'll probably still be dwarfed by even a medium sized manufacturer's testing lab in terms of hours and resources.

All that to say, I'm doing my best, always trying to improve, but also realizing I'm one person, trying to help a community, in the best ways I can. And if my benchmarking is taken as being misleading, that's not my intention, and it's also often a matter of perspective.

geerlingguy commented 2 years ago

Also, as this thread is going wildly off course, I'm considering locking it unless discussion stays on topic: SMB performance, the RTL8125B chip, or Realtek's or Broadcom's SoC performance are welcome. As are discussions around using a Pi as a NAS or the ASUSTOR's performance (especially regarding the Realtek driver).

If you want to point out flaws in myraid other devices (USB to SATA, USB to Ethernet, and graphics, etc.), please either find a more relevant issue for it, or open a new issue or discussion (especially for general benchmarking).

ThomasKaiser commented 2 years ago

you're implicating I'm throwing out meaningless numbers

Nope. Sorry for not being more clear or appearing rude (non native english speaker here always accused of the same).

It's not a matter of 'wrong' numbers but of methodology. Quoting one of my personal heroes: Casual benchmarking: you benchmark A, but actually measure B, and conclude you've measured C.

To stay on topic: as already mentioned you need to monitor and/or control the environment the benchmarks are running in. And with NAS performance measurements it's block size that matters. As such I tried to give some suggestions like Lantest, using iozone to test with different block sizes, an AppleScript snippet to time Finder copies and so on.

BTW: you do an amazing job with all your extremely well documented testings especially compared to those YT guys who ignore people still able to read. And you do also a great educational job (you're the one who introduced the concept of random I/O to the RPi world :) ). As such please forgive me critising your methodology here and there... my goal is to get insights and improve overall situation in this area :)

geerlingguy commented 2 years ago

@ThomasKaiser - Okay, thanks :) — and like I said, I am always eager to do better. The AppleScript alone will save a bit of time—I thought I would be resigned to having to screen record and count frames forever :P

geerlingguy commented 2 years ago

Just wanted to mention I got a new build of ADM today to test the Realtek driver. So I'm going to check with iperf3 if it's any faster.

Before (ADM 4.0.1.ROG1):

root@asustor-arm:/volume1/home/admin # docker run -it ajoergensen/iperf3 -c 10.0.100.100
Connecting to host 10.0.100.100, port 5201
[  5] local 172.17.0.2 port 48634 connected to 10.0.100.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   227 MBytes  1.91 Gbits/sec    7   7.26 MBytes       
[  5]   1.00-2.00   sec   225 MBytes  1.89 Gbits/sec    0   7.26 MBytes       
[  5]   2.00-3.00   sec   225 MBytes  1.89 Gbits/sec    0   7.26 MBytes       
[  5]   3.00-4.00   sec   225 MBytes  1.89 Gbits/sec    0   7.26 MBytes       
[  5]   4.00-5.00   sec   225 MBytes  1.89 Gbits/sec    0   7.44 MBytes       
[  5]   5.00-6.00   sec   225 MBytes  1.89 Gbits/sec    0   7.44 MBytes       
[  5]   6.00-7.00   sec   224 MBytes  1.88 Gbits/sec    0   7.44 MBytes       
[  5]   7.00-8.00   sec   225 MBytes  1.89 Gbits/sec    0   7.44 MBytes       
[  5]   8.00-9.00   sec   225 MBytes  1.89 Gbits/sec    0   7.44 MBytes       
[  5]   9.00-10.00  sec   225 MBytes  1.89 Gbits/sec    0   7.44 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.20 GBytes  1.89 Gbits/sec    7             sender
[  5]   0.00-10.01  sec  2.20 GBytes  1.89 Gbits/sec                  receiver

After (custom ADM build 4.0.2.BPE1):

root@asustor-arm:/volume1/home/admin # docker run -it ajoergensen/iperf3 -c 10.0.100.100
Connecting to host 10.0.100.100, port 5201
[  5] local 172.17.0.2 port 44632 connected to 10.0.100.100 port 5201
[ ID] Interval           Transfer     Bitrate         Retr  Cwnd
[  5]   0.00-1.00   sec   224 MBytes  1.88 Gbits/sec   45   7.19 MBytes       
[  5]   1.00-2.00   sec   222 MBytes  1.86 Gbits/sec    0   7.19 MBytes       
[  5]   2.00-3.00   sec   221 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
[  5]   3.00-4.00   sec   221 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
[  5]   4.00-5.00   sec   222 MBytes  1.87 Gbits/sec    0   7.49 MBytes       
[  5]   5.00-6.00   sec   221 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
[  5]   6.00-7.00   sec   221 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
[  5]   7.00-8.00   sec   222 MBytes  1.87 Gbits/sec    0   7.49 MBytes       
[  5]   8.00-9.00   sec   221 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
[  5]   9.00-10.00  sec   222 MBytes  1.86 Gbits/sec    0   7.49 MBytes       
- - - - - - - - - - - - - - - - - - - - - - - - -
[ ID] Interval           Transfer     Bitrate         Retr
[  5]   0.00-10.00  sec  2.17 GBytes  1.86 Gbits/sec   45             sender
[  5]   0.00-10.02  sec  2.17 GBytes  1.86 Gbits/sec                  receiver

Samba performance test (after):

mkfile 8G test.zip && \
rsync -h --stats test.zip /Volumes/Public/test.zip && \
rm -f test.zip && sleep 60 && \
rsync -h --stats /Volumes/Public/test.zip test.zip && \
rm -f /Volumes/Public/test.zip  && \
rm -f test.zip

# Mac to ASUSTOR (write)
sent 8.59G bytes  received 35 bytes  89.04M bytes/sec
total size is 8.59G  speedup is 1.00

# ASUSTOR to Mac (read)
sent 8.59G bytes  received 35 bytes  135.31M bytes/sec
total size is 8.59G  speedup is 1.00

(Compare to earlier results: https://github.com/geerlingguy/raspberry-pi-pcie-devices/issues/162#issuecomment-996906725)

I did notice with the new driver in place, it mentioned not using MSI/MSI-X—maybe due to the bus constraints on the Realtek SoC on the Drivestor 4 Pro, it switches into a mode that's as slow as the in-kernel driver (technically very slightly slower):

root@asustor-arm:/volume1/home/admin # dmesg | grep r8125
[   14.052836] r8125 2.5Gigabit Ethernet driver 9.007.01-NAPI loaded
[   14.059374] r8125 0001:01:00.0: enabling device (0000 -> 0003)
[   14.067619] r8125 0001:01:00.0: no MSI/MSI-X. Back to INTx.
[   14.089225] r8125: This product is covered by one or more of the following patents: US6,570,884, US6,115,776, and US6,327,625.
[   14.107688] r8125  Copyright (C) 2021  Realtek NIC software team <nicfae@realtek.com> 
[   27.097654] r8125: eth0: link up