Open geerlingguy opened 1 year ago
Note: Most benchmarks were run with 16k page size; Geekbench 6 would not work with 16k so I rebooted on 4k page size kernel8.
See video: Rasbperry Pi 5: EVERYTHING you need to know
And blog post: Testing PCIe on the Raspberry Pi 5
sbc-bench results: https://github.com/ThomasKaiser/sbc-bench/issues/77
Can you please post output from lsusb
and lspci
?
lsusb
:
pi@pi5:~ $ sudo lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
pi@pi5:~ $ sudo lsusb -t
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
lspci
:
pi@pi5:~ $ sudo lspci -vvvv
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21) (prog-if 00 [Normal decode])
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx-
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 40
Bus: primary=00, secondary=01, subordinate=01, sec-latency=0
Memory behind bridge: 00000000-005fffff [size=6M] [32-bit]
Prefetchable memory behind bridge: 00000000fff00000-00000000000fffff [disabled] [64-bit]
Secondary status: 66MHz- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- <SERR- <PERR-
BridgeCtl: Parity- SERR- NoISA- VGA- VGA16- MAbort- >Reset- FastB2B-
PriDiscTmr- SecDiscTmr- DiscTmrStat- DiscTmrSERREn-
Capabilities: [48] Power Management version 3
Flags: PMEClk- DSI- D1- D2- AuxCurrent=0mA PME(D0+,D1-,D2-,D3hot+,D3cold-)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [ac] Express (v2) Root Port (Slot-), MSI 00
DevCap: MaxPayload 512 bytes, PhantFunc 0
ExtTag- RBE+
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr+ NoSnoop+
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr- NonFatalErr- FatalErr- UnsupReq- AuxPwr- TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L1, Exit Latency L1 <2us
ClockPM+ Surprise- LLActRep- BwNot+ ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4
TrErr- Train- SlotClk+ DLActive- BWMgmt+ ABWMgmt+
RootCap: CRSVisible+
RootCtl: ErrCorrectable- ErrNon-Fatal- ErrFatal- PMEIntEna+ CRSVisible-
RootSta: PME ReqID 0000, PMEStatus- PMEPending-
DevCap2: Completion Timeout: Range ABCD, TimeoutDis+ NROPrPrP- LTR+
10BitTagComp- 10BitTagReq- OBFF Via WAKE#, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- LN System CLS Not Supported, TPHComp- ExtTPHComp- ARIFwd+
AtomicOpsCap: Routing- 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled, ARIFwd-
AtomicOpsCtl: ReqEn- EgressBlck-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS+
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported, DRS-
DownstreamComp: Link Up - Present
Capabilities: [100 v1] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
RootCmd: CERptEn+ NFERptEn+ FERptEn+
RootSta: CERcvd- MultCERcvd- UERcvd- MultUERcvd-
FirstFatal- NonFatalMsg- FatalMsg- IntMsg 0
ErrorSrc: ERR_COR: 0000 ERR_FATAL/NONFATAL: 0000
Capabilities: [160 v1] Virtual Channel
Caps: LPEVC=0 RefClk=100ns PATEntryBits=1
Arb: Fixed- WRR32- WRR64- WRR128-
Ctrl: ArbSelect=Fixed
Status: InProgress-
VC0: Caps: PATOffset=00 MaxTimeSlots=1 RejSnoopTrans-
Arb: Fixed- WRR32- WRR64- WRR128- TWRR128- WRR256-
Ctrl: Enable+ ID=0 ArbSelect=Fixed TC/VC=ff
Status: NegoPending- InProgress-
Capabilities: [180 v1] Vendor Specific Information: ID=0000 Rev=0 Len=028 <?>
Capabilities: [240 v1] L1 PM Substates
L1SubCap: PCI-PM_L1.2+ PCI-PM_L1.1+ ASPM_L1.2+ ASPM_L1.1+ L1_PM_Substates+
PortCommonModeRestoreTime=8us PortTPowerOnTime=10us
L1SubCtl1: PCI-PM_L1.2- PCI-PM_L1.1- ASPM_L1.2- ASPM_L1.1-
T_CommonMode=1us LTR1.2_Threshold=0ns
L1SubCtl2: T_PwrOn=10us
Capabilities: [300 v1] Secondary PCI Express
LnkCtl3: LnkEquIntrruptEn- PerformEqu-
LaneErrStat: 0
Kernel driver in use: pcieport
0001:01:00.0 Ethernet controller: Device 1de4:0001
Control: I/O- Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast >TAbort- <TAbort- <MAbort- >SERR- <PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 40
Region 0: Memory at 1f00410000 (32-bit, non-prefetchable) [size=16K]
Region 1: Memory at 1f00000000 (32-bit, non-prefetchable) [virtual] [size=4M]
Region 2: Memory at 1f00400000 (32-bit, non-prefetchable) [size=64K]
Capabilities: [40] Power Management version 3
Flags: PMEClk- DSI- D1+ D2- AuxCurrent=375mA PME(D0+,D1+,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst+ PME-Enable- DSel=0 DScale=0 PME-
Capabilities: [70] Express (v2) Endpoint, MSI 00
DevCap: MaxPayload 256 bytes, PhantFunc 0, Latency L0s unlimited, L1 unlimited
ExtTag- AttnBtn- AttnInd- PwrInd- RBE+ FLReset- SlotPowerLimit 0W
DevCtl: CorrErr+ NonFatalErr+ FatalErr+ UnsupReq+
RlxdOrd+ ExtTag- PhantFunc- AuxPwr- NoSnoop-
MaxPayload 256 bytes, MaxReadReq 512 bytes
DevSta: CorrErr+ NonFatalErr- FatalErr- UnsupReq- AuxPwr+ TransPend-
LnkCap: Port #0, Speed 5GT/s, Width x4, ASPM L0s L1, Exit Latency L0s <1us, L1 <2us
ClockPM- Surprise- LLActRep- BwNot- ASPMOptComp+
LnkCtl: ASPM L1 Enabled; RCB 64 bytes, Disabled- CommClk+
ExtSynch- ClockPM- AutWidDis- BWInt- AutBWInt-
LnkSta: Speed 5GT/s, Width x4
TrErr- Train- SlotClk+ DLActive- BWMgmt- ABWMgmt-
DevCap2: Completion Timeout: Not Supported, TimeoutDis+ NROPrPrP- LTR-
10BitTagComp- 10BitTagReq- OBFF Not Supported, ExtFmt- EETLPPrefix-
EmergencyPowerReduction Not Supported, EmergencyPowerReductionInit-
FRS- TPHComp- ExtTPHComp-
AtomicOpsCap: 32bit- 64bit- 128bitCAS-
DevCtl2: Completion Timeout: 50us to 50ms, TimeoutDis- LTR- 10BitTagReq- OBFF Disabled,
AtomicOpsCtl: ReqEn-
LnkCap2: Supported Link Speeds: 2.5-5GT/s, Crosslink- Retimer- 2Retimers- DRS-
LnkCtl2: Target Link Speed: 5GT/s, EnterCompliance- SpeedDis-
Transmit Margin: Normal Operating Range, EnterModifiedCompliance- ComplianceSOS-
Compliance Preset/De-emphasis: -6dB de-emphasis, 0dB preshoot
LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete- EqualizationPhase1-
EqualizationPhase2- EqualizationPhase3- LinkEqualizationRequest-
Retimer- 2Retimers- CrosslinkRes: unsupported
Capabilities: [b0] MSI-X: Enable+ Count=61 Masked-
Vector table: BAR=0 offset=00000000
PBA: BAR=0 offset=00002000
Capabilities: [100 v2] Advanced Error Reporting
UESta: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UEMsk: DLP- SDES- TLP- FCP- CmpltTO- CmpltAbrt- UnxCmplt- RxOF- MalfTLP- ECRC- UnsupReq- ACSViol-
UESvrt: DLP+ SDES+ TLP- FCP+ CmpltTO- CmpltAbrt- UnxCmplt- RxOF+ MalfTLP+ ECRC- UnsupReq- ACSViol-
CESta: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr-
CEMsk: RxErr- BadTLP- BadDLLP- Rollover- Timeout- AdvNonFatalErr+
AERCap: First Error Pointer: 00, ECRCGenCap+ ECRCGenEn- ECRCChkCap+ ECRCChkEn-
MultHdrRecCap- MultHdrRecEn- TLPPfxPres- HdrLogCap-
HeaderLog: 00000000 00000000 00000000 00000000
Kernel driver in use: rp1
Note that the above lspci
was run when I had nothing attached to the 'ext' port—it's currently booting off a microSD card in the official case, and working with the external connector with the Pi inside the case is a bit tricky :)
The new UART header is presumably to make it easier to attach the Pi Debug Probe. Do we get GPIO 14/15 as a second UART, then? Can we use the UART header for something other than the serial console?
Thank you. I'm just trying to figure out how the whole PCIe setup looks like...
Does the 'external' PCIe lane originate from the SoC or from RP1? When you force PCIe to Gen3 does this affect only the external 'slot' or also interconnection with RP1? Which USB controller is inside the thing?
The NIC has the RPi vendor ID (1de4:0001
) but what about USB? Maybe some USB peripherals need to be connected for the controller showing up on the PCIe bus?
They got their own PCIe Vender ID! : D Also interesting to see that most interface is packed in a single device and named as ethernet device. I wonder how is the CSI camera sending back the data through it.
The new UART header is presumably to make it easier to attach the Pi Debug Probe. Do we get GPIO 14/15 as a second UART, then? Can we use the UART header for something other than the serial console?
@rgov you still get the separate UART via RP1 on GPIO 14/15, that one is configured as always via config.txt — the UART header is direct into the BCM2711 SoC.
Does the 'external' PCIe lane originate from the SoC or from RP1? When you force PCIe to Gen3 does this affect only the external 'slot' or also interconnection with RP1? Which USB controller is inside the thing?
@ThomasKaiser ext PCIe from SoC (so when you set dtparam=pciex1_gen=3
that only affects the external PCIe lane, not RP1).
The NIC has the RPi vendor ID (1de4:0001) but what about USB? Maybe some USB peripherals need to be connected for the controller showing up on the PCIe bus?
With a 1TB SSD (SanDisk Extreme) attached:
pi@pi5:~ $ lsusb
Bus 004 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 003 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
Bus 002 Device 002: ID 0781:5588 SanDisk Corp. Extreme Pro
Bus 002 Device 001: ID 1d6b:0003 Linux Foundation 3.0 root hub
Bus 001 Device 001: ID 1d6b:0002 Linux Foundation 2.0 root hub
pi@pi5:~ $ sudo lsusb -t
/: Bus 04.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
/: Bus 03.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
/: Bus 02.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/1p, 5000M
|__ Port 1: Dev 2, If 0, Class=Mass Storage, Driver=uas, 5000M
/: Bus 01.Port 1: Dev 1, Class=root_hub, Driver=xhci-hcd/2p, 480M
pi@pi5:~ $ lspci
0001:00:00.0 PCI bridge: Broadcom Inc. and subsidiaries Device 2712 (rev 21)
0001:01:00.0 Ethernet controller: Device 1de4:0001
Not sure what magic they're pulling, it all seems to route through RP1 on those x4 lanes.
Thanks for the idle power value: important for running off-grid, though still higher than my current 3B at about 1W...
https://www.earth.org.uk/note-on-Raspberry-Pi-3-setup.html
And a short note from me on the RPi5 power: http://www.earth.org.uk/note-on-site-technicals-76.html#2023-09-28
https://gist.github.com/cleverca22/97e46998b5fbd4e2f6772127c0cd034a device tree in here, where you can see all of the fun bits in the rp1, including an adc!
https://gist.github.com/cleverca22/97e46998b5fbd4e2f6772127c0cd034a
Thanks for this. So in the RP1 Ethernet is Cadence MACB/GEM IP and USB the usual Synopsys DesignWare 3 stuff while the SoC itself also seems to contain a RGMII capable GMAC (if RPi Trading Ltd. comes up with a new CM5 pinout they could maybe use this other Gigabit Ethernet as well)
ah, good catch, i missed that
Looks like it is still limited to USB OTG 2.0 like the Pi4 [1].
Rock 3/4 have USB OTG 3.0 but Rock 5 only has USB OTG 2.0 like all the raspberrypi's.
I need a SBC with USB OTG 3.0 for a fast piSCSI USB Drive.
@geerlingguy have you tried to power the RPi 5 with something else than their new 5V/5A power brick?
Curious whether it supports higher profiles with supply voltages exceeding 5V or whether this limitation is a hard one? Also curious whether debugfs
is enabled and something shows up below /sys/kernel/debug/usb/
that hints on which USB PD chip they use. On Rock 5B for example you get detailed negotiation status/process when doing this:
cat /sys/kernel/debug/usb/fusb302-4-0022 /sys/kernel/debug/usb/tcpm-4-0022 | sort
Also curious whether USB-C details show up when doing something like grep "" /sys/class/typec/port0/* 2>/dev/null
?
And in general providing dmesg
output would be great since answering a bunch of questions without us annoying you :)
See video: Rasbperry Pi 5: EVERYTHING you need to know
Graph at 9:25 looks suspect: you appear to be using a cumulative plot. So for example the blue "108.9" point is actually at y position 108.9+54.9=163.8. Screenshot: https://sphere.chronosempire.org.uk/~HEx/shots/latency.png
Here's what I'd expect it to look like: https://sphere.chronosempire.org.uk/~HEx/pi5.svg
Code:
import matplotlib.pyplot as plt
from matplotlib import ticker
size = range(0,17)
boards = {
'Pi 4': [0.0,0.0,0.0,0.0,0.0,0.0,4.7,7.2,10.3,11.9,22.7,80.9,108.9,129.4,139.8,145.1,156.5],
'Pi 5': [0.0,0.0,0.0,0.0,0.0,0.0,0.0,1.1,1.6,1.9,8.3,12.2,54.9,86.8,103.0,112.9,118.7]
}
plt.rcParams["figure.figsize"] = (12,8)
fig, ax = plt.subplots()
ax.xaxis.set_major_formatter(ticker.FuncFormatter(lambda n,pos: "%d"%(2**n)))
ax.grid()
for q in boards.keys():
ax.plot(size, boards[q], marker='.', markeredgecolor='black',label=q)
ax.fill_between(size, boards[q])
for xy in zip(size,boards[q]):
ax.annotate('%.1f' % xy[1], xy=xy, textcoords='data')
ax.legend(loc='upper left')
ax.set_title('Memory latency, single random read')
ax.set_xlabel('Size (KiB)')
ax.set_ylabel('Latency (ns)')
plt.savefig("pi5.svg")
Looks like it is still limited to USB OTG 2.0 like the Pi4 [1]. Rock 3/4 have USB OTG 3.0 but Rock 5 only has USB OTG 2.0 like all the raspberrypi's.
@lukts30
ROCK 5A and ROCK 5B both has USB OTG 3.0, ROCK 5B OTG is at USB C. ROCK 5A OTG is at the upper double layer USB 3.0. Please check the Tech Spec section: https://radxa.com/rock5a/
Now lets just hope there is plenty of supply!
Graph at 9:25 looks suspect: you appear to be using a cumulative plot. So for example the blue "108.9" point is actually at y position 108.9+54.9=163.8.
@hexwab - Oh my! You're correct. Chalk that up to 'was up too late benchmarking' :O
I'm updating the graph and wording in my blog post.
@ThomasKaiser - Full dmesg
output is below the fold here:
Following up on https://mast.hpc.social/deck/@geerlingguy@mastodon.social/111143860188995259
2 questions:
CPU: What ISA extensions are supported by the chip? the A76 should be an Aarch64 chip that implements baseline Arm v8.2 with select extensions such as the Dot product extensions from V8.4 the speculative load/store memory extensions from V8.5 amongst others (lscpu -e should do it? at worse Answered here https://github.com/geerlingguy/sbc-reviews/issues/21#issuecomment-1739759454cat /proc/cpuinfo >> ~/cpuinfo_dump.txt
)
GPU/Multimedia ASIC: What formats/profile does the HEVC decoder support? is it 8bit 420 only? does it implement 10bit (required for "proper" HDR)? Easiest way to check with be vainfo, but that assumes that VAAPI is implemented and that it's installed
Hello, can you test glmark2 and glmark2-es2 score and do a simple comparison between rpi4b?
@hexwab - Oh my! You're correct. Chalk that up to 'was up too late benchmarking' :O
I'm updating the graph and wording in my blog post.
Cheers. While I'm nitpicking: have you considered SVG graphs (rather than JPEG of all things)?
@geerlingguy, do you plan to focus more on "tweaking" in order to achieve the lowest possible idle power consumption? (like disabling hdmi, wifi, bluetooth, undervolting, …) BTW, thanks for your great report and video about RPi 5!
//edit: I like how they did indicator of memory version (8G, 4G, 2G, 1G) it's like <input type="radio" value="8G" checked />
in real life.
//edit2: Why there are 2022 ?! It's written on the board under USB 2 header. It was produced in 2022?
I would certainly like to know that set of tweaks for my application... B^>
What ISA extensions are supported by the chip?
Either grep for "Flags" here or see there:
fp asimd evtstrm aes pmull sha1 sha2 crc32 atomics fphp asimdhp cpuid asimdrdm lrcpc dcpop asimddp
Gotcha, so looks like we get neon (ASIMD), FP16 (FPHP + asimdhp or "Advanced SIMD half precision), FP64 FMA( asimdrdm or Rounding Double Multiply Accumulate/Subtract) SIMD dot products (asimddp)
device tree in here, where you can see all of the fun bits in the rp1, including an adc!
BTW: sbc-bench output collected by Jeff shows lm-sensors
output:
rp1_adc-isa-0000
in1: 1.47 V
in2: 2.53 V
in3: 1.41 V
in4: 1.38 V
temp1: +55.5 C
Also accessible via sysfs
. There it's hwmon1
:
/sys/class/hwmon/hwmon1/in1_input:1475
/sys/class/hwmon/hwmon1/in1_raw:1824
/sys/class/hwmon/hwmon1/in2_input:2539
/sys/class/hwmon/hwmon1/in2_raw:3149
/sys/class/hwmon/hwmon1/in3_input:1413
/sys/class/hwmon/hwmon1/in3_raw:1544
/sys/class/hwmon/hwmon1/in4_input:1385
/sys/class/hwmon/hwmon1/in4_raw:1522
/sys/class/hwmon/hwmon1/name:rp1_adc
/sys/class/hwmon/hwmon1/temp1_input:51985
/sys/class/hwmon/hwmon1/temp1_raw:823
/sys/class/hwmon/hwmon1/uevent:OF_NAME=adc
/sys/class/hwmon/hwmon1/uevent:OF_FULLNAME=/axi/pcie@120000/rp1/adc@c8000
/sys/class/hwmon/hwmon1/uevent:OF_COMPATIBLE_0=raspberrypi,rp1-adc
/sys/class/hwmon/hwmon1/uevent:OF_COMPATIBLE_N=1
Hello, can you test glmark2 and glmark2-es2 score and do a simple comparison between rpi4b?
@Headcrabed - mentioned in my video (linked earlier), GLMark2 scores:
hey @geerlingguy . First off great video. I have tried leaving a youtube comment though all my comments get deleted (this has been an ongoing thing for years now, gotta love youtube/google).
I am a developer/maintainer at pi-apps https://github.com/Botspot/pi-apps and just want to say that your experience with installing Steam is not expected or shown to occur by any of our own testing on Bookworm on pi4 (Bookworm beta has been available through invite only on the rpi forum for a month or two now).
If you still have that install I would be interested in seeing the Steam install logs from ~/pi-apps/logs/
My initial guess as to what caused the issue would be that the rpi foundation gave you an image to test and did not update the apt repository with new packages until later (likely today as I saw many updates). Installing the armhf mesa (and dependencies) which the box86/steam scripts do could have inadvertently removed critical arm64 OS packages if the repository had only older versions of armhf packages compared to the locally installed arm64 packages.
Note: Most benchmarks were run with 16k page size; Geekbench 6 would not work with 16k so I rebooted on 4k page size kernel8.
@geerlingguy wait wait wait... this is news to me. On the pi5 they intend on using the 16K pagesize kernel by default? Why? This is actually REALLY important information. Please send back the output of
getconf PAGE_SIZE
in each of the modes you are testing
CC: @botspot
Geekbench 6: (748 single / 1507 multi - https://browser.geekbench.com/v6/cpu/2808487)
Pretty obvious that there's something seriously wrong if a quad-core CPU scores only twice as much 'multi-threaded' compared to 'single-threaded'. Looks like some internal bottleneck with multithreaded tasks?
For anyone interested in how important memory access is, this is the comparison with another 5B (from Radxa) testing only the four A76 cores of Rockchip RK3588: https://browser.geekbench.com/v6/cpu/compare/2808487?baseline=2818821 (sbc-bench -G
in Geekbench-Mode by default also tests each CPU cluster individually)
So this is two times four A76 cores (RPi 5B running at 2.4 GHz, Rock 5B at only ~2.3 GHz) in comparison:
(full results output: http://ix.io/4HGm)
Edit: retested with Geekbench 6.2 to match Jeff's version.
Edit 2: Since Jeff right now seems to be the only one who ran GB 6 let's look at the few GB5 scores. Multi-threaded to single-threaded ratio not as bad as with Jeff's GB6 result but still hinting at something seriously wrong since MT scores are just 2.5 higher than ST.
Do we know if it supports non-PD USB-C at full power? I have a bunch of Pi 4b just hanging off a 5.1V Meanwell with USB-C cables and no negotiation.
Do we know if it supports non-PD USB-C at full power? I have a bunch of Pi 4b just hanging off a 5.1V Meanwell with USB-C cables and no negotiation.
I, too, am interested in the differences in performance between a 5V/3A and the Trading Companies 5.1V/5A supply. Supposedly the board does something different when it recognizes the supply, but it's unclear to me is that is just in passing through more power to the USB ports or if the CPU will use more of it--also depending on cooling capacity, one assumes.
@dwillmore - There's a flag you can set to override the functionality, which already exists for the Pi 4, usb_max_current_enable
— setting this to usb_max_current_enable=1
in /boot/config.txt
will remove the current limiting for a PSU that negotiates less than 5A, or for powering the Pi via GPIO pins.
There may also be a mechanism for throttling the CPU (either automatically or via some other flag) while preserving the USB max current (1.6A), otherwise it's best to find a suitable 5V/5A PSU.
Also, regarding overclocking, I haven't done much testing yet, but did push arm_freq=2600
and even 2800
; 2600 was stable though temps were up about 3-5°C. 2800 was slightly unstable under load, but the system never crashed.
Instead of over_voltage
, Pi engineers are recommending using over_voltage_delta
(e.g. over_voltage_delta=50000
for an increase of 50 mV), and have explained the reasoning thusly:
over_voltage_delta
adds the offset AFTER DVFS has run whereasover_voltage
essentially defeats the voltage scaling.
Edit 2: Since Jeff right now seems to be the only one who ran GB 6 let's look at the few GB5 scores. Multi-threaded to single-threaded ratio not as bad as with Jeff's GB6 result but still hinting at something seriously wrong since MT scores are just 2.5 higher than ST.
@ThomasKaiser - A couple notes: Geekbench 6.2 doesn't run with 16K page size, so I had to switch to the kernel8
4K page size to get it to run; I have opened an issue with Primate Labs for it: Can't run Geekbench on systems with 16k page size.
Geekbench 6 in general is a strange beast on Arm (thus it's still in Preview officially ;). I think there are still a few bugs to be ironed out, though the single core Geekbench 6 numbers between my Pi 4 and Rock 5 B are identical: https://browser.geekbench.com/user/446940
@theofficialgman - Yes, currently they are trialling 16k page size in the Bookworm release for Pi 5, but switching is as simple as:
[pi5]
kernel=kernel8.img
The alpha testers have been tracking a few issues with software running 32-bit binaries that have issues... the default (between 4k and 16k) isn't set in stone yet, so now would be a good time to run it up the flagpole if there are more implications :)
Geekbench 6: (748 single / 1507 multi - https://browser.geekbench.com/v6/cpu/2808487)
Pretty obvious that there's something seriously wrong if a quad-core CPU scores only twice as much 'multi-threaded' compared to 'single-threaded'. Looks like some internal bottleneck with multithreaded tasks?
For anyone interested in how important memory access is, this is the comparison with another 5B (from Radxa) testing only the four A76 cores of Rockchip RK3588: https://browser.geekbench.com/v6/cpu/compare/2808487?baseline=2818821 (
sbc-bench -G
in Geekbench-Mode by default also tests each CPU cluster individually)So this is two times four A76 cores (RPi 5B running at 2.4 GHz, Rock 5B at only ~2.3 GHz) in comparison:
[image removed - see it above]
(full results output: http://ix.io/4HGm)
Edit: retested with Geekbench 6.2 to match Jeff's version.
Edit 2: Since Jeff right now seems to be the only one who ran GB 6 let's look at the few GB5 scores. Multi-threaded to single-threaded ratio not as bad as with Jeff's GB6 result but still hinting at something seriously wrong since MT scores are just 2.5 higher than ST.
Some of the multi-thread benches work in cycles and wait for all threads to finnish before issuing another. In real use with the likes of whisper.cpp or llama.ccp running on all cores is either scaling doesn't work well or that it causes queueing for the small cores to finnish as the RK3588 is faster using 4 threads and slows on 8 with the previous. So guess Geekbench maybe similar also under linux core 0-3 is little, 4-7 is big and often wondered if Geekbench just picks the 1st core in some tests which may be 0 and little. Geekbench is a bit pants and https://github.com/ThomasKaiser/sbc-bench is likely a bit better, dunno if it is optimised for newer Arm v8.2 though.
Hello, can you test glmark2 and glmark2-es2 score and do a simple comparison between rpi4b?
@Headcrabed - mentioned in my video (linked earlier), GLMark2 scores:
* 117 fullscreen * 905 windowed
Thank you, and could you test clpeak through rusticl or test vkpeak? So we could compare its compute performance between other sbcs.
under linux core 0-3 is little, 4-7 is big
Not with Amlogic A311D2, there it's the other way around.
and often wondered if Geekbench just picks the 1st core in some tests which may be 0 and little.
As written before: when using sbc-bench
's Geekbench mode (-G
) the clusters will also be tested individually. The little cores were all killed prior to running Geekbench to get an apples to apples comparison. Again: this is RK3588 castrated to meet BCM2712 capabilities: two times quad-core A76, no little cores involved, exact same software, huge multi-threaded difference: https://browser.geekbench.com/v6/cpu/compare/2808487?baseline=2818821
Geekbench 6 in general is a strange beast on Arm
Sure but still... how is it possible that one quad-core A76 (BCM2712) performs so poorly compared to another (RK3588 w/o little cores)?
Sure but still... how is it possible that one quad-core A76 (BCM2712) performs so poorly compared to another (RK3588 w/o little cores)?
Single LPDDR4X chip (dual-channel) on RPi vs two (quad-channel) on RK3588 boards.
RAM bandwidth has a great effect on multi-core performance.
device tree in here
BTW: according to the commented official bcm2712.dtsi there's three PCIe 3.0 controllers inside BCM2712 providing six Gen3 lanes in total (all downgraded to Gen2 speeds by default for obvious reasons like idle consumption)
@geerlingguy I'm interested in what is exposed as part of v4l2_m2m
(if anything at all). Could you run the https://github.com/ayufan/camera-streamer/blob/main/tools/dump_cameras.sh?
@theofficialgman - Yes, currently they are trialling 16k page size in the Bookworm release for Pi 5, but switching is as simple as:
[pi5] kernel=kernel8.img
The alpha testers have been tracking a few issues with software running 32-bit binaries that have issues... the default (between 4k and 16k) isn't set in stone yet, so now would be a good time to run it up the flagpole if there are more implications :)
Great thanks. CC: @ptitseb looks like you will be wanting to implement 16K pagesize in box86 after all.
@theofficialgman - Yes, currently they are trialling 16k page size in the Bookworm release for Pi 5, but switching is as simple as:
[pi5] kernel=kernel8.img
The alpha testers have been tracking a few issues with software running 32-bit binaries that have issues... the default (between 4k and 16k) isn't set in stone yet, so now would be a good time to run it up the flagpole if there are more implications :)
Great thanks. CC: @ptitSeb looks like you will be wanting to implement 16K pagesize in box86 after all.
Oh, thanks for the info. I'll add that in my todo.
@geerlingguy what does the armhf 32bit image use as it's kernel? An armv6 kernel with 4k pages? An armv7 kernel with 4k pages? Armv8 with 4 or 16k pages?
There is a lot of Linux software that is pagesize specific (look at asahi Linux's list) and most of these things were only fixed for arm64 (and only in Arch Linux, not necessarily in the debian bookworm packages) https://github.com/AsahiLinux/docs/wiki/Broken-Software
Single LPDDR4X chip (dual-channel) on RPi vs two (quad-channel) on RK3588 boards.
This explains it partially, at least for Geekbench while 7-zip's internal benchmark for example is not affected since here memory latency is important and not so much bandwidth.
This is still crippled RK3588 (little cores killed) with big cores slightly lower clocked than the A76 in BCM2712 and DRAM clocked the first time at 2112 MHz and then only at 528 MHz:
So while Rock 5B with DRAM clocked at just 528 MHz shows significantly lower memcpy/memset scores it still easily outperforms RPi 5B multi-threaded in GB6 which hints at something else going on inside BCM2712 too.
And this is 528 MHz vs. 2112 MHz in direct comparison to see which of the individual Geekbench tests is affected by memory bandwidth and which isn't (that much): https://browser.geekbench.com/v6/cpu/compare/2826821?baseline=2818821
16KB pages + running 32-bit apps in general isn't expected to work all that well... I'm surprised that Raspberry Pi even tried that. Do you happen to know how well that works if at all?
@geerlingguy can you possibly add photo of the other side of the board below the one you have there? Maybe even x-ray photo would be nice :-)
There are 7 pairs of traces going to the PCI-E FPC connector, that may be too much for 1x pci-e? However two pairs go to another layer, maybe these are SD slot data pins after all - is the sd slot on the other side? However I guess there would be another pair of capacitors if there were two pcie lanes going that way.
EDIT: After zooming the flat cable there are only 16 pins so with power and some extra pci-e slot stuff it is just about right for 1 lane.
@geerlingguy Thank you very much for the detailed launch-day coverage!
Does the PTP support via the RP1 include an user-accessible PPS output?
edit: seems like the PTP support is handled by the PHY (BCM54213PE) (which was the reason why the RPi4 does not offer PTP vs. the CM4, right?). So the question would be if its sync pin is routed somewhere accessible?
Basic information
Linux/system information
Benchmark results
CPU
Power
stress-ng --matrix 0
): 9.7 Wtop500
HPL benchmark: 11 W (2.75 Gflops/W)Disk
SanDisk Extreme 128GB microSD
Kioxia XG8 1TB NVMe at PCIe Gen 2
Kioxia XG8 1TB NVMe at PCIe Gen 3
Single SanDisk Extreme PRO USB 3.1 Flash Drive
2x SanDisk Extreme PRO USB 3.1 Flash Drives (simultaneous)
Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with
curl -o disk-benchmark.sh [URL_HERE]
and runsudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh
(assuming the device issda
).Network
Built-in Ethernet
iperf3 -c $SERVER_IP
: 938 Mbpsiperf3 --reverse -c $SERVER_IP
: 942 Mbpsiperf3 --bidir -c $SERVER_IP
: 930 Mbps up, 600 Mbps downBuilt-in WiFi
iperf3 -c $SERVER_IP
: 186 Mbpsiperf3 --reverse -c $SERVER_IP
: 207 Mbpsiperf3 --bidir -c $SERVER_IP
: 2.15 Mbps up, 206 Mbps downUSB 3.0 Pluggable USBC-E2500 2.5 Gbps Adapter
iperf3 -c $SERVER_IP
: 1.55 Gbpsiperf3 --reverse -c $SERVER_IP
: 300 Mbpsiperf3 --bidir -c $SERVER_IP
: 1.56 Gbps up, 153 Mbps downASUS XG-C100C 10G Network Adapter (Aquantia AQC107)
iperf3 -c $SERVER_IP
: 5.63 Gbpsiperf3 --reverse -c $SERVER_IP
: 6.05 Gbpsiperf3 --bidir -c $SERVER_IP
: 4.40 Gbps up, 2.50 Gbps downGPU
Compatibility reports:
Memory
tinymembench
results:Click to expand memory benchmark result
``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 5676.0 MB/s (1.9%) C copy backwards (32 byte blocks) : 5649.5 MB/s (0.6%) C copy backwards (64 byte blocks) : 5688.0 MB/s (0.6%) C copy : 4870.3 MB/s (0.2%) C copy prefetched (32 bytes step) : 4808.4 MB/s C copy prefetched (64 bytes step) : 4819.7 MB/s (0.2%) C 2-pass copy : 4887.0 MB/s (0.6%) C 2-pass copy prefetched (32 bytes step) : 4849.2 MB/s C 2-pass copy prefetched (64 bytes step) : 4845.1 MB/s C fill : 13672.2 MB/s (0.5%) C fill (shuffle within 16 byte blocks) : 13655.6 MB/s (0.5%) C fill (shuffle within 32 byte blocks) : 13697.6 MB/s (0.5%) C fill (shuffle within 64 byte blocks) : 13703.3 MB/s (0.4%) NEON 64x2 COPY : 4813.8 MB/s (0.1%) NEON 64x2x4 COPY : 4832.2 MB/s (0.1%) NEON 64x1x4_x2 COPY : 4830.8 MB/s (0.1%) NEON 64x2 COPY prefetch x2 : 4298.5 MB/s NEON 64x2x4 COPY prefetch x1 : 4335.5 MB/s NEON 64x2 COPY prefetch x1 : 4306.5 MB/s NEON 64x2x4 COPY prefetch x1 : 4336.0 MB/s --- standard memcpy : 4805.4 MB/s standard memset : 13676.9 MB/s (0.6%) --- NEON LDP/STP copy : 4793.9 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 4821.9 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 4829.9 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 4811.7 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 4812.1 MB/s NEON LD1/ST1 copy : 4818.2 MB/s NEON STP fill : 13659.2 MB/s (0.5%) NEON STNP fill : 13683.3 MB/s (0.4%) ARM LDP/STP copy : 4818.7 MB/s ARM STP fill : 13557.6 MB/s ARM STNP fill : 13682.7 MB/s (0.4%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.1 ns / 1.5 ns 262144 : 1.6 ns / 2.0 ns 524288 : 1.9 ns / 2.2 ns 1048576 : 8.3 ns / 11.2 ns 2097152 : 12.2 ns / 14.4 ns 4194304 : 54.9 ns / 84.1 ns 8388608 : 86.8 ns / 119.4 ns 16777216 : 103.0 ns / 131.6 ns 33554432 : 112.9 ns / 138.3 ns 67108864 : 118.7 ns / 142.2 ns ```Phoronix Test Suite
Results from pi-general-benchmark.sh:
Other Data
Crypto performance as measured by OpenSSL (see sbc-bench ARMv8 Crypto Extensions):
PTP Hardware Timestamping support via Cadence GEM: