Closed geerlingguy closed 4 years ago
Caveats: The Turing Pi cluster is running 7 Compute Modules, CM3+ boards with 1 GB of RAM each. The Pi Dramble cluster is running 4 Pi 4 2 GB models. The two clusters used the same exact K3s setup and Turing Pi Cluster configurations.
I did a 15 minute burn-in (running wrk
on both the Drupal and Wordpress sites simultaneously) and it seems like the Pis were able to keep their cool—just barely.
Guess which one is worker-03
?
Running the same burn-in test on the Dramble cluster (with the same configuration) now. I'll try to remember to grab an IR image as well, 10 minutes in or so.
One interesting takeaway: Wordpress is surprisingly CPU intense in its default config for non-authed users, whereas Drupal is way lighter and can serve up 2x the requests using its default caching mechanisms. They're equal when you throw a caching proxy in front or dump to HTML.
So... I just re-tested the Dramble cluster running Raspberry Pi OS's 64-bit beta version, and while Wordpress and Minecraft were faster, Drupal was slower. I'm trying to see if maybe something in Drupal core got way slower in the most recent point release (doubtful), or if maybe Hypriot vs Raspberry Pi OS 32-bit might have some performance issue that Drupal runs into.
To be clear: I'm running the exact same hardware (4 2 GB Pi 4s), with the exact same microSD cards (physically—I am re-flashing the same cards with different OSes so it can't be a card-to-card variance), and running all tests at least 3 times allowing a 30 minute warm-up period before starting.
(Benchmarking is hard, but turns up interesting results sometimes.)
I'm also seeing a fair bit of variance in how long it takes Minecraft to generate a new world... so I'm going to have to re-test that a bit.
I'm guessing the reason Drupal/Wordpress are slower is because on a memory-constrained 64-bit OS, you can't squash as many workers into the same amount of memory, so you might be able to have 3 or 4 workers on a 512 MB instance running Drupal on 32-bit, whereas you can only fit 2 or 3 on 64-bit due to pointers being larger, and these PHP applications storing a lot of tiny bits of state in memory.
So 64-bit benefits purely-CPU-driven processing (e.g. encodes and such—see https://github.com/geerlingguy/drupal-pi/issues/45), and processes where you have massive amounts of RAM available (e.g. 4 GB or more), but actually presents a bit of a limit with highly-memory-constrained environments (e.g. < 2 GB of RAM).
Just noting that since I'm sure someone may ask—yes, you can max out the bandwidth of all the Turing Pi nodes at the same time, and the onboard Gigabit switch will handle the traffic. As an example, I set up a connection to two pis and ran iperf
to both of them from my Mac (which gets ~930 Mbps to a single Pi 4) at the same time:
$ iperf -c 10.0.100.163 & iperf -c 10.0.100.74
[1] 19606
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 10.0.100.74, TCP port 5001
Client connecting to 10.0.100.163, TCP port 5001
TCP window size: 129 KByte (default)
------------------------------------------------------------
TCP window size: 129 KByte (default)
------------------------------------------------------------
[ 4] local 10.0.100.118 port 54978 connected with 10.0.100.74 port 5001
[ 4] local 10.0.100.118 port 54977 connected with 10.0.100.163 port 5001
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 113 MBytes 94.7 Mbits/sec
[ ID] Interval Transfer Bandwidth
[ 4] 0.0-10.0 sec 112 MBytes 93.9 Mbits/sec
[1] + done iperf -c 10.0.100.163
I could confirm that the Mac's network stats showed double the throughput during the test (~24 MB/sec when doing both, vs ~12 MB/sec when doing just one).
Leaving open just so I can re-run the Minecraft world generation benchmark again in the 3 scenarios where I didn't record the correct results.
Brad Manske posted this comment on my YouTube channel:
I didn't even notice—the two Inateck external cases I had were both the older 'non-UASP' type (link on Amazon — note the two options, with/without UASP). So I ordered a UASP type case and will be updating the benchmarks above (and in the next video).
Edit: Holy cow, big improvement!
Disk | hdparm | dd | 4K Random Read | 4K Random Write |
---|---|---|---|---|
Pi 4 Kingston USB 3.0 SSD w/o UASP | 172.13 MB/sec | 102.67 MB/sec | 14.41 MB/sec | 23.28 MB/sec |
Pi 4 Kingston USB 3.0 SSD w/ UASP | 296.71 MB/sec | 149.00 MB/sec | 20.59 MB/sec | 28.54 MB/sec |
% difference | 53% faster | 37% faster | 35% faster | 20% faster |
To confirm that the drive is using UASP, check with lsusb -t
:
$ lsusb -t
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci_hcd/4p, 5000M
|__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=uas, 5000M
It shows Driver=uas
for UASP-enabled drives (USB Attached SCSI Protocol), and usb-storage
for non-UASP (Bulk-Only Transport/BOT).
Also, regarding power consumption:
Note: Power measurements taken with Pi 4 headless with Pi-FAN plugged in via GPIO (uses ~0.20A), Ethernet, and the USB 3.0 drive. Used a Satechi USB-C power tester.
Good article on UASP from 2015: What Is A UASP Storage Enclosure?.
Does that same improvement seen on the USB 3.0-native Pi 4 translate at all to USB 2.0 ports on the CM3+?
Disk | hdparm | dd | 4K Random Read | 4K Random Write |
---|---|---|---|---|
CM3+ Kingston USB 3.0 SSD w/o UASP | 32.00 MB/sec | 30.40 MB/sec | 8.87 MB/sec | 10.04 MB/sec |
CM3+ Kingston USB 3.0 SSD w/ UASP | 31.79 MB/sec | 31.70 MB/sec | 7.48 MB/sec | 8.55 MB/sec |
It was mounted with usb-storage
driver when I tested with the CM PoE Board:
Port 4: Dev 4, If 0, Class=Mass Storage, Driver=usb-storage, 480M
Some unofficial confirmation that the BCM2835 doesn't support UASP because it lacks 'scatter gather' which is a requirement of the Linux UASP driver.
Benchmarking episode is live (Episode 5 - Benchmarking the Turing Pi), and I referenced it in the README in this commit: https://github.com/geerlingguy/turing-pi-cluster/commit/6191dd74cd1148f33434757c0120396199fafea9
I will likely be doing a little more disk IO testing later as I have some new hardware coming in, but that can wait for a new issue... this one's already a bit overloaded!
See related: Test Performance and Functionality on Raspberry Pi OS 64-bit
With the full configuration (excluding NextCloud), all tests run 4x, discarded first result (unless noted), then averaged the next 3:
Disk Benchmarks
iozone -e -I -a -s 100M -r 4k -i 0 -i 1 -i 2
(installed using these directions)(Another note: The onboard eMMC also does large file writes at ~5-10 MB/sec whereas the microSD cards can do 20-40 MB/sec in some cases... but the eMMC is a way better option for general purpose computing since it's more durable and 3x faster than the fastest microSD cards for random IO (and 10-100x faster than the majority of microSD cards I've tested).
Network Benchmarks
iperf -s
on Pi master node,iperf -c [pi-pi]
on my Mac, connected through TRENDnet 10/100/1000 5-port network switch.Note that the Turing Pi cluster does support the full 95 Mbps on each Pi simultaneously. So you can saturate a 1 Gbps connection to the Turing Pi cluster as a whole.
Full System Benchmarks
7-node Turing Pi Cluster
32-bit HypriotOS
ab
)wrk
)ab
)wrk
)4-node Pi Dramble Cluster (Running K3s, same configs)
32-bit HypriotOS
ab
)wrk
)ab
)wrk
)32-bit Raspberry Pi OS
ab
)wrk
)ab
)wrk
)64-bit Raspberry Pi OS
ab
)wrk
)ab
)wrk
)ab
test example:ab -n 100 -c 10 -C "SESSION=COOKIE" http://drupal.10.0.100.74.nip.io/
wrk
test example:wrk -t4 -c100 -d30 http://drupal.10.0.100.74.nip.io/
kubectl logs -n minecraft -l app=minecraft-minecraft | grep Done
1 This test needs to be re-run after 15 minute cool-down and run three times in progression. The first time I ran the tests I accidentally only took the first run, and didn't take the average of the final three runs. Oops.