Raspberry Pi Compute Module 4

geerlingguy commented 1 year ago

Basic information

Board URL (official): https://www.raspberrypi.com/products/compute-module-4/?variant=raspberry-pi-cm4001000
Board purchased from: PiShop.us
Board purchase date: 2021-08-05
Board specs (as tested): Wireless, 8GB, Lite - CM4108000
Board price (as tested): $75.00

Linux/system information

# output of `neofetch`
       _,met$$$$$gg.          pi@cm4 
    ,g$$$$$$$$$$$$$$$P.       ------ 
  ,g$$P"     """Y$$.".        OS: Debian GNU/Linux 12 (bookworm) aarch64 
 ,$$P'              `$$$.     Host: Raspberry Pi Compute Module 4 Rev 1.0 
',$$P       ,ggs.     `$$b:   Kernel: 6.6.28+rpt-rpi-v8 
`d$$'     ,$P"'   .    $$$    Uptime: 36 mins 
 $$P      d$'     ,    $$P    Packages: 1553 (dpkg) 
 $$:      $$.   -    ,d$$'    Shell: bash 5.2.15 
 $$;      Y$b._   _,d$P'      Resolution: 1920x1080 
 Y$$.    `.`"Y$$$$P"'         Terminal: /dev/pts/0 
 `$$b      "-.__              CPU: (4) @ 1.500GHz 
  `Y$$                        Memory: 377MiB / 7810MiB 
   `Y$$.
     `$$b.                                            
       `Y$$b.                                         
          `"Y$b._
              `"""

# output of `uname -a`
Linux cm4 6.6.28+rpt-rpi-v8 #1 SMP PREEMPT Debian 1:6.6.28-1+rpt1 (2024-04-22) aarch64 GNU/Linux

Benchmark results

CPU

Geekbench 6: (256 single / 626 multi - https://browser.geekbench.com/v6/cpu/6192800)
11.433 Gflops at 5.2W - 2.20 Gflops/W (HPL result)

Power

Idle power draw (at wall): 2.5 W
Maximum simulated power draw (stress-ng --matrix 0): 4.6 W
During Geekbench multicore benchmark: 4.9 W
During top500 HPL benchmark: 5.2 W

Disk

Samsung 512GB Pro Plus microSD card

Benchmark	Result
iozone 4K random read	16.03 MB/s
iozone 4K random write	5.54 MB/s
iozone 1M random read	41.51 MB/s
iozone 1M random write	30.47 MB/s
iozone 1M sequential read	41.46 MB/s
iozone 1M sequential write	30.74 MB/s

Network

iperf3 results:

iperf3 -c $SERVER_IP: 938 Mbps
iperf3 -c $SERVER_IP --reverse: 942 Mbps
iperf3 -c $SERVER_IP --bidir: 940 Mbps up, 16.5 Mbps down

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

glmark2-es2 results:

=======================================================
    glmark2 2023.01
=======================================================
    OpenGL Information
    GL_VENDOR:      Broadcom
    GL_RENDERER:    V3D 4.2
    GL_VERSION:     OpenGL ES 3.1 Mesa 23.2.1-1~bpo12+rpt3
    Surface Config: buf=32 r=8 g=8 b=8 a=8 depth=24 stencil=0 samples=0
    Surface Size:   800x600 windowed
=======================================================
[build] use-vbo=false: FPS: 999 FrameTime: 1.001 ms
[build] use-vbo=true: FPS: 1519 FrameTime: 0.659 ms
[texture] texture-filter=nearest: FPS: 1238 FrameTime: 0.808 ms
[texture] texture-filter=linear: FPS: 1224 FrameTime: 0.817 ms
[texture] texture-filter=mipmap: FPS: 1206 FrameTime: 0.829 ms
[shading] shading=gouraud: FPS: 1178 FrameTime: 0.849 ms
[shading] shading=blinn-phong-inf: FPS: 925 FrameTime: 1.082 ms
[shading] shading=phong: FPS: 724 FrameTime: 1.383 ms
[shading] shading=cel: FPS: 688 FrameTime: 1.455 ms
[bump] bump-render=high-poly: FPS: 590 FrameTime: 1.695 ms
[bump] bump-render=normals: FPS: 1242 FrameTime: 0.805 ms
[bump] bump-render=height: FPS: 1150 FrameTime: 0.870 ms
[effect2d] kernel=0,1,0;1,-4,1;0,1,0;: FPS: 437 FrameTime: 2.292 ms
[effect2d] kernel=1,1,1,1,1;1,1,1,1,1;1,1,1,1,1;: FPS: 217 FrameTime: 4.619 ms
[pulsar] light=false:quads=5:texture=false: FPS: 1331 FrameTime: 0.751 ms
[desktop] blur-radius=5:effect=blur:passes=1:separable=true:windows=4: FPS: 114 FrameTime: 8.785 ms
[desktop] effect=shadow:windows=4: FPS: 434 FrameTime: 2.304 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 178 FrameTime: 5.646 ms
[buffer] columns=200:interleave=false:update-dispersion=0.9:update-fraction=0.5:update-method=subdata: FPS: 180 FrameTime: 5.561 ms
[buffer] columns=200:interleave=true:update-dispersion=0.9:update-fraction=0.5:update-method=map: FPS: 226 FrameTime: 4.426 ms
[ideas] speed=duration: FPS: 837 FrameTime: 1.195 ms
[jellyfish] <default>: FPS: 414 FrameTime: 2.420 ms
[terrain] <default>: FPS: 26 FrameTime: 39.617 ms
[shadow] <default>: FPS: 108 FrameTime: 9.317 ms
[refract] <default>: FPS: 35 FrameTime: 28.797 ms
[conditionals] fragment-steps=0:vertex-steps=0: FPS: 1401 FrameTime: 0.714 ms
[conditionals] fragment-steps=5:vertex-steps=0: FPS: 723 FrameTime: 1.384 ms
[conditionals] fragment-steps=0:vertex-steps=5: FPS: 1326 FrameTime: 0.755 ms
[function] fragment-complexity=low:fragment-steps=5: FPS: 1027 FrameTime: 0.974 ms
[function] fragment-complexity=medium:fragment-steps=5: FPS: 627 FrameTime: 1.595 ms
[loop] fragment-loop=false:fragment-steps=5:vertex-steps=5: FPS: 979 FrameTime: 1.022 ms
[loop] fragment-steps=5:fragment-uniform=false:vertex-steps=5: FPS: 980 FrameTime: 1.021 ms
[loop] fragment-steps=5:fragment-uniform=true:vertex-steps=5: FPS: 602 FrameTime: 1.663 ms
=======================================================
                                  glmark2 Score: 753 
=======================================================

TODO: See this issue for discussion about a full suite of standardized GPU benchmarks.

Memory

tinymembench results:

Click to expand memory benchmark result

``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 2901.5 MB/s (1.2%) C copy backwards (32 byte blocks) : 2900.6 MB/s C copy backwards (64 byte blocks) : 2894.8 MB/s C copy : 2509.5 MB/s (0.4%) C copy prefetched (32 bytes step) : 2887.2 MB/s C copy prefetched (64 bytes step) : 2883.8 MB/s (0.6%) C 2-pass copy : 1515.9 MB/s (0.2%) C 2-pass copy prefetched (32 bytes step) : 2346.4 MB/s (0.3%) C 2-pass copy prefetched (64 bytes step) : 2376.9 MB/s (0.3%) C fill : 3227.5 MB/s (1.2%) C fill (shuffle within 16 byte blocks) : 3225.9 MB/s (1.2%) C fill (shuffle within 32 byte blocks) : 3203.1 MB/s (0.8%) C fill (shuffle within 64 byte blocks) : 3152.5 MB/s (0.4%) NEON 64x2 COPY : 2855.8 MB/s (0.6%) NEON 64x2x4 COPY : 2854.5 MB/s NEON 64x1x4_x2 COPY : 2853.7 MB/s NEON 64x2 COPY prefetch x2 : 2838.9 MB/s NEON 64x2x4 COPY prefetch x1 : 2843.2 MB/s NEON 64x2 COPY prefetch x1 : 2842.5 MB/s NEON 64x2x4 COPY prefetch x1 : 2844.5 MB/s --- standard memcpy : 2831.0 MB/s standard memset : 3200.7 MB/s (0.8%) --- NEON LDP/STP copy : 2855.2 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 2839.4 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 2842.4 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 2840.1 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 2837.3 MB/s NEON LD1/ST1 copy : 2855.0 MB/s NEON STP fill : 3215.7 MB/s (0.8%) NEON STNP fill : 2373.0 MB/s (2.8%) ARM LDP/STP copy : 2856.0 MB/s ARM STP fill : 3224.5 MB/s (0.9%) ARM STNP fill : 2432.6 MB/s (2.4%) ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON LDP/STP copy (from framebuffer) : 763.2 MB/s (0.6%) NEON LDP/STP 2-pass copy (from framebuffer) : 680.4 MB/s NEON LD1/ST1 copy (from framebuffer) : 836.6 MB/s NEON LD1/ST1 2-pass copy (from framebuffer) : 699.4 MB/s ARM LDP/STP copy (from framebuffer) : 578.4 MB/s (0.2%) ARM LDP/STP 2-pass copy (from framebuffer) : 549.3 MB/s (0.8%) ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 5.7 ns / 8.9 ns 131072 : 8.6 ns / 11.9 ns 262144 : 12.3 ns / 15.8 ns 524288 : 14.2 ns / 18.1 ns 1048576 : 25.7 ns / 38.4 ns 2097152 : 80.2 ns / 115.9 ns 4194304 : 107.4 ns / 139.2 ns 8388608 : 128.2 ns / 160.1 ns 16777216 : 138.6 ns / 169.8 ns 33554432 : 144.0 ns / 175.1 ns 67108864 : 155.0 ns / 193.6 ns ```

`sbc-bench` results

Before at 36.5°C:

    cpu0 (Cortex-A72): OPP: 1500, ThreadX: 1500, Measured: 1498 

After at 55.0°C:

    cpu0 (Cortex-A72): OPP: 1500, ThreadX: 1500, Measured: 1498 

### Performance baseline

  * memcpy: 2583.9 MB/s, memchr: 4872.6 MB/s, memset: 3117.5 MB/s
  * 16M latency: 153.5 157.7 157.6 157.9 157.4 159.2 164.1 200.5 
  * 128M latency: 175.3 173.7 172.0 172.4 174.6 181.9 186.9 210.3 
  * 7-zip MIPS (3 consecutive runs): 5051, 5078, 5095 (5070 avg), single-threaded: 1517
  * `aes-256-cbc      27610.00k    29333.76k    29992.53k    30169.77k    30217.56k    30212.10k`
  * `aes-256-cbc      27635.33k    29337.28k    30000.64k    30168.75k    30203.90k    30212.10k`

Phoronix Test Suite

Results from pi-general-benchmark.sh:

pts/encode-mp3: 29.619 sec
pts/x264 4K: 1.53 fps
pts/x264 1080p: 6.64 fps
pts/phpbench: 166730
pts/build-linux-kernel (defconfig): 7291.734 sec

github-actions[bot] commented 1 year ago

This issue has been marked 'stale' due to lack of recent activity. If there is no further activity, the issue will be closed in another 30 days. Thank you for your contribution!

Please read this blog post to see the reasons why I mark issues as stale.

aspencuozzo commented 5 months ago

Would be very interested to see some of these numbers if anyone could run tests, particularly curious how power draw compares to the standard Pi 4 as the CM4 datasheet says it uses slightly less.

geerlingguy commented 5 months ago

@aspencuozzo - Ha! I completely forgot I haven't filled in all the details here. Will do so today.

geerlingguy / sbc-reviews