Closed geerlingguy closed 2 years ago
That took quite some time—and mounting took a few minutes on its own, as each block device was slowly read!
Here's benchmark results on the linear array (probably should be close to what an individual drive gets):
Benchmark | Result |
---|---|
fio 1M sequential read | 214 MB/s |
iozone 1M random read | 95.44 MB/s |
iozone 1M random write | 130.94 MB/s |
iozone 4K random read | 22.87 MB/s |
iozone 4K random write | 17.48 MB/s |
Testing a 73 GB SMB file copy from my Mac over the wired network, I was averaging 100-110 MB/sec throughout, though there were a number of periods where the speed would dip to 10-30 MB/sec and get spikey for up to 20 seconds or so:
smbd on the Pi was running at 130-150% CPU, and IRQ interrupts were high but only in the 50-80% range. I'm not 100% sure where the actual bottlenecking was.
In speaking with a Broadcom engineer, it sounds like the error I'm seeing in the other RAID conditions, namely:
[ 278.151884] mpt3sas_cm1 fault info from func: mpt3sas_base_make_ioc_ready
[ 278.151904] mpt3sas_cm1: fault_state(0x2623)!
[ 278.151911] mpt3sas_cm1: sending diag reset !!
Is a PCI Express message pull error detecting potential data corruption—and that's either related to signal integrity of the cabling or power.
#define IFAULT_IOP_PCI_EXPRESS_MSGPULLDMA_ERROR (0x2623) /**< Message Pull State Machine encountered an error. */
I wouldn't be surprised if either, honestly... I'm using one of those GPU mining boards, and as I've learned in the past, they're not the paragon of excellence.
That same engineer also recommended upgrading the cards' firmware. Right now they're all on version 05.00.00.00
:
pi@sas:~ $ sudo ./storcli64 /c3 show
CLI Version = 007.2103.0000.0000 Dec 08, 2021
Operating system = Linux 5.15.35-v8+
Controller = 3
Status = Success
Description = None
Product Name = HBA 9405W-16i
Serial Number = SP93121358
SAS Address = 500605b00f3df7f0
PCI Address = 00:06:00:00
System Time = 05/13/2022 11:16:14
FW Package Build = 05.00.00.00
FW Version = 05.00.00.00
BIOS Version = 09.09.00.00_05.00.00.00
NVDATA Version = 04.03.00.03
PSOC Version = 00000000
Driver Name = mpt3sas
Driver Version = 39.100.00.00
Latest version is from January: https://docs.broadcom.com/docs/9405W_16i_Pkg_P22_SAS_SATA_NVMe_FW_BIOS_UEFI.zip (22.00.00.00
).
Hmm...
$ sudo /home/pi/storcli64 /c0 download file=/home/pi/9405W_16i_Pkg_P22_SAS_SATA_NVMe_FW_BIOS_UEFI/Firmware/HBA_9405W-16i_Mixed_Profile.bin
Downloading image.Please wait...
CLI Version = 007.2103.0000.0000 Dec 08, 2021
Operating system = Linux 5.15.35-v8+
Controller = 0
Status = Failure
Description = The firmware flash image is invalid
Switching gears one more time... what if I create a RAID 6 array on the 16 drives attached to each storage controller, then stripe them together using mdadm?
./storcli64 /c0 add vd type=raid6 drives=0:0-15
./storcli64 /c1 add vd type=raid6 drives=0:0-15
./storcli64 /c2 add vd type=raid6 drives=0:0-15
./storcli64 /c3 add vd type=raid6 drives=0:0-15
However... it looks like the drives are all in JBOD mode right now, and if I try setting jbod=off
with sudo ./storcli64 /c0 set jbod=off
I get 'Un-supported command'. So trying to figure out if this will be possible.
Also, I might not be able to flash the firmware on the Pi; it might have to be done on an x86 machine :(
New video featuring this card is live here: https://www.youtube.com/watch?v=BBnomwpF_uY
Both @Coreforge and dumbasPL suggested in YouTube comments (example) forcing PCIe gen1 speeds for better stability. Might have to try that then try the various RAID setups again.
A few things I'd like to test before swapping back to the Xeon setup and pulling these HBAs:
wait so this project started a year ago wow that's one day after my b day
A data point:
I have a 9405W-16i card in my Ryzen 1600 server, and I can't upgrade to any firmware newer than 14.0.0.0, with that same error message (I tried it from Linux and from an UEFI shell)
@MartijnVdS - Oh... the plot thickens. Also, one of the Broadcom engineers mentioned I might be able to do incremental upgrades, starting from an older version and slowly progressing. But it would be interesting to see if I can get to 14.x but no further too. I'll test with an older revision early this week.
(To add a note since I didn't update this issue: I have tried upgrading to 22.x in my PC and it said the signature wasn't valid.)
That's how I got to version 14. But all firmware versions newer than that fail to install with that "The firmware flash image is invalid" message.
Alrighty then... after far more debugging than I'd ever like to attempt again—but will, inevitably—I found out you can use at least the P14 version of StorCLI, I think from sometime in April 2020, and flash any of the images to the card, including the latest P23.
Just like with how one line patches usually have a backstory...
The firmware flash image is invalid
warning.The firmware flash image is invalid
. Switched back to P14 version of StorCLI, P20 flash successful.So then I pulled the 2nd card out of the Storinator, and plugged it in, and decided to try going straight from P05 to P23 using the P14 revision of StorCLI. It worked!
So then I pulled the other two cards and quickly flashed them up to P23. I'm wondering at this point if the flashing might've worked on the Pi, too, if I had run an older version of StorCLI...
Anyways, the proof is in the screenshot:
Just to confirm: the above procedure works for me as well.
The hardest part was finding the old version of StorCLI on Broadcom's support web site.
After a lot of searching I ended up here: https://www.broadcom.com/support/download-search?pg=Storage+Adapters,+Controllers,+and+ICs&pf=SAS/SATA/NVMe+Host+Bus+Adapters&pn=&pa=&po=&dk=9405w&pl=&l=false
which has a heading "Management Software and Tools", which in turn has a tab "Archive", where you can find the older version.
You can also download the firmware there (make sure you get the correct one -- 16i or 16e).
@MartijnVdS - Good to know I'm not alone there! And like you, I spent a while clicking around in vain trying to find older versions until I finally found the "Archive" link in each section. The results aren't really google-able either, since they're all buried in a javascript frontend :(
All right, another BTRFS raid 0 attempt:
pi@sas:~ $ sudo mkfs.btrfs -L btrfs -d raid0 -m raid0 -f /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas /dev/sdat /dev/sdau /dev/sdav /dev/sdaw /dev/sdax /dev/sday /dev/sdaz /dev/sdba /dev/sdbb /dev/sdbc /dev/sdbd /dev/sdbe /dev/sdbf /dev/sdbg /dev/sdbh
btrfs-progs v5.10.1
See http://btrfs.wiki.kernel.org for more information.
Label: btrfs
UUID: 23f7340d-1a34-46d0-acc4-58c9418a90f3
Node size: 16384
Sector size: 4096
Filesystem size: 1.07PiB
Block group profiles:
Data: RAID0 10.00GiB
Metadata: RAID0 1.88GiB
System: RAID0 58.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Runtime features:
Checksum: crc32c
Number of devices: 60
Devices:
ID SIZE PATH
1 18.19TiB /dev/sda
2 18.19TiB /dev/sdb
3 18.19TiB /dev/sdc
4 18.19TiB /dev/sdd
5 18.19TiB /dev/sde
6 18.19TiB /dev/sdf
7 18.19TiB /dev/sdg
8 18.19TiB /dev/sdh
9 18.19TiB /dev/sdi
10 18.19TiB /dev/sdj
11 18.19TiB /dev/sdk
12 18.19TiB /dev/sdl
13 18.19TiB /dev/sdm
14 18.19TiB /dev/sdn
15 18.19TiB /dev/sdo
16 18.19TiB /dev/sdp
17 18.19TiB /dev/sdq
18 18.19TiB /dev/sdr
19 18.19TiB /dev/sds
20 18.19TiB /dev/sdt
21 18.19TiB /dev/sdu
22 18.19TiB /dev/sdv
23 18.19TiB /dev/sdw
24 18.19TiB /dev/sdx
25 18.19TiB /dev/sdy
26 18.19TiB /dev/sdz
27 18.19TiB /dev/sdaa
28 18.19TiB /dev/sdab
29 18.19TiB /dev/sdac
30 18.19TiB /dev/sdad
31 18.19TiB /dev/sdae
32 18.19TiB /dev/sdaf
33 18.19TiB /dev/sdag
34 18.19TiB /dev/sdah
35 18.19TiB /dev/sdai
36 18.19TiB /dev/sdaj
37 18.19TiB /dev/sdak
38 18.19TiB /dev/sdal
39 18.19TiB /dev/sdam
40 18.19TiB /dev/sdan
41 18.19TiB /dev/sdao
42 18.19TiB /dev/sdap
43 18.19TiB /dev/sdaq
44 18.19TiB /dev/sdar
45 18.19TiB /dev/sdas
46 18.19TiB /dev/sdat
47 18.19TiB /dev/sdau
48 18.19TiB /dev/sdav
49 18.19TiB /dev/sdaw
50 18.19TiB /dev/sdax
51 18.19TiB /dev/sday
52 18.19TiB /dev/sdaz
53 18.19TiB /dev/sdba
54 18.19TiB /dev/sdbb
55 18.19TiB /dev/sdbc
56 18.19TiB /dev/sdbd
57 18.19TiB /dev/sdbe
58 18.19TiB /dev/sdbf
59 18.19TiB /dev/sdbg
60 18.19TiB /dev/sdbh
Then:
pi@sas:~ $ sudo mount /dev/sda /btrfs
pi@sas:~ $ sudo btrfs filesystem usage /btrfs
Overall:
Device size: 1.07PiB
Device allocated: 11.93GiB
Device unallocated: 1.07PiB
Device missing: 0.00B
Used: 128.00KiB
Free (estimated): 1.07PiB (min: 1.07PiB)
Free (statfs, df): 1.07PiB
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 3.25MiB (used: 0.00B)
Multiple profiles: no
Data,RAID0: Size:10.00GiB, Used:0.00B (0.00%)
/dev/sda 170.62MiB
/dev/sdb 170.62MiB
...
Comparing to earlier btrfs raid0 using disk-benchmark.sh
:
Benchmark | Result (fw 05) | Result (fw 23) |
---|---|---|
fio 1M sequential read | 213 MB/s | 238 MB/s |
iozone 1M random read | 144.82 MB/s | 142.68 MB/s |
iozone 1M random write | 233.90 MB/s | 245.33 MB/s |
iozone 4K random read | 19.45 MB/s | 19.88 MB/s |
iozone 4K random write | 15.92 MB/s | 16.35 MB/s |
Doing a network file copy of 30 GB results in similar behavior as before—with faults like 0x5854
and 0x2623
—but it seems like the system is recovering in time for the file copy to progress.
I could also manually cancel the file copy in progress from macOS Finder, and after the array recovered, it cancelled the copy. Nice!
Afterwards, without having to reboot, I still had a clean btrfs mount:
pi@sas:~ $ sudo btrfs filesystem show
Label: 'btrfs' uuid: 23f7340d-1a34-46d0-acc4-58c9418a90f3
Total devices 60 FS bytes used 2.68GiB
devid 1 size 18.19TiB used 202.62MiB path /dev/sda
devid 2 size 18.19TiB used 202.62MiB path /dev/sdb
devid 3 size 18.19TiB used 203.62MiB path /dev/sdc
...
devid 59 size 18.19TiB used 203.62MiB path /dev/sdbg
devid 60 size 18.19TiB used 203.62MiB path /dev/sdbh
So it looks like things are more stable, but there's probably still a power issue for that PCIe riser, or a signaling issue with the USB 3.0 cable that goes from the Pi to the riser.
Other things I'd still like to try:
Testing the link speed switching using pcie_set_speed.sh
:
pi@sas:~ $ sudo ./pcie-set-speed.sh 03:00.0 1
Link capabilities: 0173dc12
Max link speed: 2
Link status: 7012
Current link speed: 2
Configuring 0000:02:01.0...
Original link control 2: 00000002
Original link target speed: 2
New target link speed: 1
New link control 2: 00000001
Triggering link retraining...
Original link control: 70110040
New link control: 70110060
Link status: 7011
Current link speed: 1
(I repeated that for all the devices 3-6.) Then the device speeds according to sudo lspci -vvvv
were:
LnkSta: Speed 2.5GT/s (downgraded), Width x1 (downgraded)
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
Trying the large network file copy again...
Network copy succeeded when the cards were all at Gen 1 link speeds, and the overall copy performance over SMB was about the same! I also re-ran the disk benchmark:
Benchmark | Result (fw 23 Gen 2) | Result (fw 23 Gen 1) |
---|---|---|
fio 1M sequential read | 238 MB/s | 113 MB/s |
iozone 1M random read | 142.68 MB/s | 119.99 MB/s |
iozone 1M random write | 245.33 MB/s | 142.53 MB/s |
iozone 4K random read | 19.88 MB/s | 15.96 MB/s |
iozone 4K random write | 16.35 MB/s | 12.77 MB/s |
As expected, raw performance is lower. So things like resilvering would be extremely slow on an array with disks this large. But as a slow network file copy destination, it seems like if you want actual RAID instead of linear disk storage, it's doable at PCIe gen 1 link speeds.
I'm planning on testing banks of 15, 30, and 45 drives next, and for 15, testing with an HBA directly connected to the Pi, then through the PCIe switch, to see if the switch makes a difference even with just one card.
Testing with RAID 0 on one card only, via PCIe switch:
pi@sas:~ $ sudo mkfs.btrfs -L btrfs -d raid0 -m raid0 -f /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn
Label: btrfs
UUID: a0cbd908-ead8-421e-a6f5-6a68963ed655
Node size: 16384
Sector size: 4096
Filesystem size: 254.67TiB
Block group profiles:
Data: RAID0 10.00GiB
Metadata: RAID0 1023.75MiB
System: RAID0 15.75MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Runtime features:
Checksum: crc32c
Number of devices: 14
Devices:
ID SIZE PATH
1 18.19TiB /dev/sda
2 18.19TiB /dev/sdb
3 18.19TiB /dev/sdc
4 18.19TiB /dev/sdd
5 18.19TiB /dev/sde
6 18.19TiB /dev/sdf
7 18.19TiB /dev/sdg
8 18.19TiB /dev/sdh
9 18.19TiB /dev/sdi
10 18.19TiB /dev/sdj
11 18.19TiB /dev/sdk
12 18.19TiB /dev/sdl
13 18.19TiB /dev/sdm
14 18.19TiB /dev/sdn
Benchmark | Result (btrfs RAID 0 single HBA, 15 drives) |
---|---|
fio 1M sequential read | 237.00 MB/s |
iozone 1M random read | 119.64 MB/s |
iozone 1M random write | 295.43 MB/s |
iozone 4K random read | 24.45 MB/s |
iozone 4K random write | 8.07 MB/s |
The network copy was successful through the PCIe switch, too, so it's definitely some sort of issue with multiple cards behind the switch.
Doing the same benchmark, but with the card connected directly to the Pi:
Benchmark | Result (single HBA, switch) | Result (single HBA, direct) |
---|---|---|
fio 1M sequential read | 237.00 MB/s | 272.00 MB/s |
iozone 1M random read | 119.64 MB/s | 114.77 MB/s |
iozone 1M random write | 295.43 MB/s | 294.09 MB/s |
iozone 4K random read | 24.45 MB/s | 24.09 MB/s |
iozone 4K random write | 8.07 MB/s | 9.36 MB/s |
So the switch doesn't seem to make much difference, except maybe in the case of raw block access to a single drive (that's what the fio
benchmark I'm using is actually testing... /dev/sda
in this case). The other tests are running through the Btrfs RAID 0 array.
Well, I maybe spoke too soon. With the card direct connected, the network file copy ran at about 70 MB/sec instead of the 50-55 MB/sec I was getting when I had the HBA behind the switch. Also, file copies where it's just a read (copying data from the Pi to my Mac) max out the throughput at about 110 MB/sec.
I noticed there is a cycle when writing the data to the drives:
I'm going to test once more through the switch to see if my network file copy testing was a fluke, or what.
Okay, so my earlier test with one card in the switch must've been strange, because now I'm getting identical performance through both the switch and the Pi direct. Anyways, next tests are trying 30 drives, then 45, to see when we start getting those weird errors.
System power draw:
Number of Drives | Idle draw | Benchmark draw | Maximum draw (boot) |
---|---|---|---|
15 | 199W | 217W | 315W |
60 | 502W | 512W | 632W |
A few other measurements:
Now trying a 30 drive RAID:
$ sudo mkfs.btrfs -L btrfs -d raid0 -m raid0 -f /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad
Label: btrfs
UUID: 9c9023ac-5b97-44db-b2b0-a35b525854a0
Node size: 16384
Sector size: 4096
Filesystem size: 545.71TiB
Block group profiles:
Data: RAID0 10.00GiB
Metadata: RAID0 1023.75MiB
System: RAID0 30.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Runtime features:
Checksum: crc32c
Number of devices: 30
Devices:
ID SIZE PATH
1 18.19TiB /dev/sda
...
And benchmark results both switched:
Benchmark | Result (single HBA, 15 drives) | Result (two HBA, 30 drives) |
---|---|---|
fio 1M sequential read | 237.00 MB/s | 221.00 MB/s |
iozone 1M random read | 119.64 MB/s | 144.93 MB/s |
iozone 1M random write | 295.43 MB/s | 201.44 MB/s |
iozone 4K random read | 24.45 MB/s | 23.63 MB/s |
iozone 4K random write | 8.07 MB/s | 13.42 MB/s |
The network copy definitely slows down with 30 drives, at least the write. I averaged around 50-70 MB/sec write speeds, though read is still at 110 MB/sec or so.
And with overclock at 2.147 GHz, I was able to get back up to 80-100 MB/sec speeds on the write. So CPU performance at the default 1.5 GHz clock definitely cripples us beyond one HBA. I'm going to test with 45 drives now.
pi@sas:~ $ sudo mkfs.btrfs -L btrfs -d raid0 -m raid0 -f /dev/sda /dev/sdb /dev/sdc /dev/sdd /dev/sde /dev/sdf /dev/sdg /dev/sdh /dev/sdi /dev/sdj /dev/sdk /dev/sdl /dev/sdm /dev/sdn /dev/sdo /dev/sdp /dev/sdq /dev/sdr /dev/sds /dev/sdt /dev/sdu /dev/sdv /dev/sdw /dev/sdx /dev/sdy /dev/sdz /dev/sdaa /dev/sdab /dev/sdac /dev/sdad /dev/sdae /dev/sdaf /dev/sdag /dev/sdah /dev/sdai /dev/sdaj /dev/sdak /dev/sdal /dev/sdam /dev/sdan /dev/sdao /dev/sdap /dev/sdaq /dev/sdar /dev/sdas
Label: btrfs
UUID: 2ff7d20c-1fd4-46e9-b40f-0ba489607be3
Node size: 16384
Sector size: 4096
Filesystem size: 818.57TiB
Block group profiles:
Data: RAID0 10.00GiB
Metadata: RAID0 1.41GiB
System: RAID0 45.00MiB
SSD detected: no
Incompat features: extref, skinny-metadata
Runtime features:
Checksum: crc32c
Number of devices: 45
Devices:
ID SIZE PATH
1 18.19TiB /dev/sda
...
And benchmark results:
Benchmark | Result (1xHBA, 15 drives) | Result (2xHBA, 30 drives) | Result (3xHBA, 45 drives) |
---|---|---|---|
fio 1M sequential read | 237.00 MB/s | 221.00 MB/s | 218.00 MB/s |
iozone 1M random read | 119.64 MB/s | 144.93 MB/s | 134.82 MB/s |
iozone 1M random write | 295.43 MB/s | 201.44 MB/s | 228.53 MB/s |
iozone 4K random read | 24.45 MB/s | 23.63 MB/s | 21.08 MB/s |
iozone 4K random write | 8.07 MB/s | 13.42 MB/s | 15.59 MB/s |
That's without overclock. Overclock comparison:
Benchmark | Result (45 drives, 1.5GHz) | Result (45 drives, 2.2 GHz) |
---|---|---|
fio 1M sequential read | 218.00 MB/s | 257.00 MB/s |
iozone 1M random read | 134.82 MB/s | 177.17 MB/s |
iozone 1M random write | 228.53 MB/s | 221.99 MB/s |
iozone 4K random read | 21.08 MB/s | 20.85 MB/s |
iozone 4K random write | 15.59 MB/s | 17.93 MB/s |
All right, with or without overclock, we start hitting the random card resets/PCIe errors with 3 HBAs (45 drives) during SMB copies. I'm going to swap to the 3rd and 4th HBA and see if maybe it's just one bad HBA (though I've seen multiple cards reset when running 60 drives... I just want to verify it's the number of HBAs, and not necessarily a bad HBA).
I swapped HBAs and still got the lockup—so at PCIe Gen 2 speeds, the Pi definitely starts having issues between 30 and 45 drives / 2-3 HBAs. Though I can't rule out the PCIe switch board I'm using either. The thing is... I've already invested at least a hundred or so hours into this (maybe more), and it's time to put a pin in it.
I think I can soundly recommend only running one HBA on a Raspberry Pi. 320 TB is good enough for anyone, right? Especially when you'll only reliably get 100 MB/sec of write speeds over the network, max.
I'll update the power specs in a bit. Not going to try a different USB 3.0 cable as I don't have a shorter one :P
One more test running all 60 drives in btrfs RAID 0 with 2.2 GHz overclock and using PCIe Gen 1 link speed:
Benchmark | Result (fw 23 Gen 2) | Result (fw 23 Gen 1) | Result (fw 23 Gen 1 OC) |
---|---|---|---|
fio 1M sequential read | 238 MB/s | 113 MB/s | 137 MB/s |
iozone 1M random read | 142.68 MB/s | 119.99 MB/s | 149.06 MB/s |
iozone 1M random write | 245.33 MB/s | 142.53 MB/s | 164.02 MB/s |
iozone 4K random read | 19.88 MB/s | 15.96 MB/s | 20.19 MB/s |
iozone 4K random write | 16.35 MB/s | 12.77 MB/s | 17.17 MB/s |
OK
The LSI 9405W-16i HBA Should be similar to the 9460-16i, and should hopefully be supported on ARM (to some extent) unlike older cards like the 9305-16i (see #195). (Adding the term 9405 so this will also pop up in search.)