Turing RK1 - 16GB alpha board revision

geerlingguy commented 8 months ago

NOTE: These benchmarks are preliminary (run on an early preproduction batch of the board)—I will be re-testing with my production boards soon!

Basic information

Board URL (official): https://turingpi.com/product/turing-rk1/
Board purchased from: (Review sample, but I also pre-ordered four 8GB modules)
Board purchase date: 2023-08-15 (pre-order placed)
Board specs (as tested): 16 GB RAM / 32 GB eMMC
Board price (as tested): $169

Linux/system information

# output of `neofetch`
            .-/+oossssoo+/-.               ubuntu@ubuntu 
        `:+ssssssssssssssssss+:`           ------------- 
      -+ssssssssssssssssssyyssss+-         OS: Ubuntu 22.04.3 LTS aarch64 
    .ossssssssssssssssssdMMMNysssso.       Host: Turing Machines RK1 
   /ssssssssssshdmmNNmmyNMMMMhssssss/      Kernel: 5.10.160-rockchip 
  +ssssssssshmydMMMMMMMNddddyssssssss+     Uptime: 6 mins 
 /sssssssshNMMMyhhyyyyhmNMMMNhssssssss/    Packages: 694 (dpkg) 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.   Shell: bash 5.1.16 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Resolution: 1920x1080 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   Terminal: /dev/pts/0 
ossyNMMMNyMMhsssssssssssssshmmmhssssssso   CPU: (8) @ 1.800GHz 
+sssshhhyNMMNyssssssssssssyNMMMysssssss+   Memory: 184MiB / 15715MiB 
.ssssssssdMMMNhsssssssssshNMMMdssssssss.
 /sssssssshNMMMyhhyyyyhdNMMMNhssssssss/                            
  +sssssssssdmydMMMMMMMMddddyssssssss+                             
   /ssssssssssshdmNNNNmyNMMMMhssssss/
    .ossssssssssssssssssdMMMNysssso.
      -+sssssssssssssssssyyyssss+-
        `:+ssssssssssssssssss+:`
            .-/+oossssoo+/-.

# output of `uname -a`
Linux ubuntu 5.10.160-rockchip #17 SMP Fri Nov 3 03:00:41 UTC 2023 aarch64 aarch64 aarch64 GNU/Linux

Benchmark results

CPU

Geekbench 6: (799 single / 3154 multi - https://browser.geekbench.com/v6/cpu/3376758)
56.967 Gflops (geerlingguy/top500-benchmark HPL result)

Power

(Currently can't measure individual node power consumption.)

Idle power draw (at wall): TODO W
Maximum simulated power draw (stress-ng --matrix 0): TODO W
During Geekbench multicore benchmark: TODO W
During top500 HPL benchmark: TODO W

Disk

Built in eMMC (32GB)

Benchmark	Result
fio 1M sequential read	280 MB/s
iozone 1M random read	262.76 MB/s
iozone 1M random write	110.49 MB/s
iozone 4K random read	22.49 MB/s
iozone 4K random write	37.62 MB/s

curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash

Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with curl -o disk-benchmark.sh [URL_HERE] and run sudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh (assuming the device is sda).

Also consider running PiBenchmarks.com script.

Network

iperf3 results:

iperf3 -c $SERVER_IP: 943 Mbps
iperf3 --reverse -c $SERVER_IP: 927 Mbps
iperf3 --bidir -c $SERVER_IP: 939 Mbps up, 236 Mbps down

(Be sure to test all interfaces, noting any that are non-functional.)

GPU

TODO: Haven't determined standardized benchmark yet. See Issue #2.

Memory

tinymembench results:

Click to expand memory benchmark result

``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 11768.5 MB/s (21.3%) C copy backwards (32 byte blocks) : 11745.9 MB/s C copy backwards (64 byte blocks) : 11746.9 MB/s C copy : 12041.0 MB/s C copy prefetched (32 bytes step) : 12222.2 MB/s C copy prefetched (64 bytes step) : 12252.1 MB/s C 2-pass copy : 5175.5 MB/s (0.5%) C 2-pass copy prefetched (32 bytes step) : 7927.3 MB/s C 2-pass copy prefetched (64 bytes step) : 8370.9 MB/s C fill : 30185.7 MB/s (0.4%) C fill (shuffle within 16 byte blocks) : 30183.5 MB/s C fill (shuffle within 32 byte blocks) : 30183.9 MB/s C fill (shuffle within 64 byte blocks) : 30183.6 MB/s NEON 64x2 COPY : 12238.2 MB/s NEON 64x2x4 COPY : 12171.9 MB/s NEON 64x1x4_x2 COPY : 5605.7 MB/s NEON 64x2 COPY prefetch x2 : 11016.5 MB/s NEON 64x2x4 COPY prefetch x1 : 11359.5 MB/s NEON 64x2 COPY prefetch x1 : 11104.4 MB/s NEON 64x2x4 COPY prefetch x1 : 11359.0 MB/s --- standard memcpy : 12208.5 MB/s standard memset : 30184.2 MB/s (0.6%) --- NEON LDP/STP copy : 12232.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 12201.4 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 12224.4 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 12251.8 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 12247.8 MB/s NEON LD1/ST1 copy : 12175.6 MB/s NEON STP fill : 30185.9 MB/s (0.6%) NEON STNP fill : 30183.7 MB/s ARM LDP/STP copy : 12215.6 MB/s ARM STP fill : 30181.5 MB/s (0.6%) ARM STNP fill : 30182.1 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON LDP/STP copy (from framebuffer) : 1942.3 MB/s (0.1%) NEON LDP/STP 2-pass copy (from framebuffer) : 1654.2 MB/s NEON LD1/ST1 copy (from framebuffer) : 1941.0 MB/s NEON LD1/ST1 2-pass copy (from framebuffer) : 1661.1 MB/s ARM LDP/STP copy (from framebuffer) : 1875.6 MB/s ARM LDP/STP 2-pass copy (from framebuffer) : 1655.5 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.1 ns / 1.5 ns 262144 : 2.3 ns / 2.9 ns 524288 : 4.4 ns / 5.7 ns 1048576 : 10.1 ns / 13.0 ns 2097152 : 14.3 ns / 16.5 ns 4194304 : 62.3 ns / 92.7 ns 8388608 : 101.6 ns / 135.8 ns 16777216 : 97.3 ns / 139.9 ns 33554432 : 151.3 ns / 164.4 ns 67108864 : 141.1 ns / 161.7 ns ```

`sbc-bench` results

sbc-bench results

Phoronix Test Suite

Results from pi-general-benchmark.sh:

pts/encode-mp3: 12.278 sec
pts/x264 4K: 5.62 fps
pts/x264 1080p: 24.02 fps
pts/phpbench: 408760
pts/build-linux-kernel (defconfig): 1150.972 sec

geerlingguy commented 8 months ago

First step for bringup was getting the boards working. I plugged them into my original Turing Pi 2 prototype board... which is a bit worse for the wear :)

I couldn't get them to boot, and that old board didn't have PWM fan headers either, so I popped them into my kickstarter "v2.4" board, and they booted and had fans running full blast—but I was told I should upgrade to the 2.0.0 (currently RC1) firmware to get full functionality.

So I initially tried updating over the BMC Web UI, and found that later versions only ship an .img file, not an .swu file, so you have to do one of the more 'advanced' BMC update methods.

I initially tried using PhoenixSuit on my Windows PC, but kept running into all kinds of permissions issues (would not like to work in an ecosystem that requires disabling so many security features, heh...), then was told I can just flash the .img to microSD, boot off it, and get it upgraded that way.

So it was as easy as:

Flash the 2.0.0-RC1 .img from here to a microSD card with Etcher (if you're reading this later, go get the latest release...)
Insert the microSD card into the Turing Pi 2's rear microSD card slot.
Power on the Turing Pi 2 while observing UART through the BMC_UART3 header. (green to RX, white to TX, black to GND on my Adafruit adapter)
When prompted, type CONFIRM over UART and press enter, the firmware will upgrade.

After removing the microSD card and rebooting the board, it shows:

 _____ _   _ ____  ___ _   _  ____ 
|_   _| | | |  _ \|_ _| \ | |/ ___|
  | | | | | | |_) || ||  \| | |  _ 
  | | | |_| |  _ < | || |\  | |_| |
  |_|  \___/|_| \_\___|_| \_|\____|

Welcome to Turing Pi

turingpi login: 2023-11-02T22:04:00.464Z INFO  [bmcd] Turing Pi 2 BMC Daemon v2.0.0

Yay!

geerlingguy commented 8 months ago

I also attempted to install the RK1 into a Jetson development board. The product page states:

The Turing RK1 compute module is compatible with Nvidia Jetson pin layout. This means you can plug it into any Nvidia Jetson carrier board.

Unfortunately, the tiny retention pins on the slot on my carrier board seemed to smash into two tiny components on either side of RK1 making it close to impossible to reliably insert the board (without bending some of the metal tabs...).

The Jetson Nano has two tiny holes in the PCB in those areas to accept the little underside retention tabs, the RK1 does not. I'll upload a picture of the problem if I can remember :)

So back to Turing Pi 2 we go!

geerlingguy commented 8 months ago

It looks like @Joshua-Riek maintains a bunch of images for Rockchip RK3588 devices, and supports RK1 in that repo: https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29

I've downloaded ubuntu-22.04.3-preinstalled-server-arm64-turing-rk1.img.xz and am flashing it to node 1 via the BMC web UI... fingers crossed!

In the BMC logs:

2023-11-02T22:20:05.360Z INFO  [bmcd::persistency::app_persistency] commiting persistency to disk
2023-11-02T22:30:25.600Z INFO  [bmcd::api::streaming_data_service] #1779206829 'node 1 upgrade service' 6.58 GB - started
2023-11-02T22:30:25.607Z INFO  [bmcd::app::bmc_application] Powering off node Node1...
2023-11-02T22:30:26.219Z INFO  [bmcd::app::bmc_application] Prerequisite settings toggled, powering on...
2023-11-02T22:30:27.441Z INFO  [bmcd::app::bmc_application] Checking for presence of a USB device...
2023-11-02T22:30:27.448Z INFO  [bmcd::firmware_update::rockusb_fwudate] Maskrom mode detected. loading usb-plug..
2023-11-02T22:30:28.888Z INFO  [bmcd::app::firmware_runner] started writing to node 1
...

geerlingguy commented 8 months ago

After all that, I got:

2023-11-02T22:30:28.888Z INFO  [bmcd::app::firmware_runner] started writing to node 1
2023-11-02T22:35:26.221Z INFO  [bmcd::persistency::app_persistency] commiting persistency to disk
2023-11-02T23:31:21.412Z INFO  [bmcd::app::firmware_runner] Verifying checksum of written data to node 1
2023-11-02T23:46:51.843Z INFO  [bmcd::app::firmware_runner] Flashing node 1 successful, restarting device...
2023-11-02T23:46:52.269Z INFO  [bmcd::api::streaming_data_service] worker done. took 1h 16m 26s 669ms 421us 43ns (#1779206829)
thread 'main' panicked at 'source slice length (0) does not match destination slice length (16384)', src/utils/ring_buf.rs:22:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

And the web UI showed: Upgrade failed: [object Object]

geerlingguy commented 8 months ago

Trying with an updated version from https://github.com/Joshua-Riek/ubuntu-rockchip/actions/runs/6740498946

After the flash, the BMC UI still showed Upgrade failed: [object Object], but over in the UART console:

2023-11-03T13:54:23.395Z INFO  [bmcd::app::firmware_runner] started writing to node 1
2023-11-03T14:55:39.577Z INFO  [bmcd::app::firmware_runner] Verifying checksum of written data to node 1
2023-11-03T15:11:02.460Z INFO  [bmcd::app::firmware_runner] Flashing node 1 successful, restarting device...
2023-11-03T15:11:02.949Z INFO  [bmcd::api::streaming_data_service] worker done. took 1h 16m 42s 748ms 472us 61ns (#3864914314)
...

geerlingguy commented 8 months ago

...it looks like the RK1 is booted, but it's not giving me any SSH access. Default user/pass should be ubuntu (both).

Rebooted again, and now I've found it on the network, logged in and running some tests :)

Once the Ubuntu image is loading, it seems the PWM fan ramps down at idle, which is nice. It only ramps up to full speed if you're burning all cores always.

The little orange Ethernet LED also works now—with the factory buildroot test image, only the green ACT LED would be blinking for Ethernet.

I'm going to flash the other two now.

(The fans are blissfully quiet now that they're all on PWM.)

Joshua-Riek commented 8 months ago

It might need a moment for cloud-init to do its thing. But the default credentials are ubuntu and ubuntu. You can connect to the node through UART on the BMC if ssh does not want to work.

Connect to the Turing RK1 through BMC

On the BMC use microcom to connect

microcom -s 115200 /dev/ttySX

Slot number to devnode:	Slot	Devnode
Slot 1	`/dev/ttyS2`
Slot 2	`/dev/ttyS1`
Slot 3	`/dev/ttyS4`
Slot 4	`/dev/ttyS5`

alexellis commented 8 months ago

That's a nice flashing experience, better than I was expecting having worked with the CM4. So you insert an SD card with an image, then use the BMC over HTTP to flash it to a node's eMMC? Is the serial console over USB a must - or would the HTTP BMC work on its own?

alexellis commented 8 months ago

@Joshua-Riek Your images look like great community resource. Those build times look really quite long. Are you using QEMU for them, or how do you build them? This is why I ask - we've been building on bare-metal aarch64 servers and seeing huge improvements in build time.

Joshua-Riek commented 8 months ago

I cross-compile the bootloader and kernel, but then the rootfs is built with QEMU. I have an army of RK3588 dev boards that would be perfect to set up as GitHub runners to speed things up, but lack the space right now to do so. I need to optimize the build process and figure out how best to proceed. I never thought I would need this much computational power when the project started, but it should only grow from here.

geerlingguy commented 8 months ago

All my benchmarks are a few percentage points faster than Rock 5 B, which is interesting.

ThomasKaiser commented 8 months ago

All my benchmarks are a few percentage points faster than Rock 5 B, which is interesting.

According to the submitted sbc-bench result the DMC memory governor was set to performance (no idea whether that's @Joshua-Riek's intention or caused by running sbc-bench -r prior which adjust the DMC memory governor to 'always top speed'). So if you ran Geekbench after sbc-bench -r w/o any reboot in between Geekbench was executed with DRAM clocked all the time at 2112 MHz.

When you benchmarked your Rock 5B most probably the software shipped back then defaulted to dmc_ondemand memory governor adjusting memory clock dynamically between 528 MHz and 2112 MHz based on CPU utilization. Actual benchmark scores then did depend on the up_treshold value, see at the very bottom of:

https://github.com/ThomasKaiser/Knowledge/blob/master/articles/Quick_Preview_of_ROCK_5B.md#important-insights-and-suggested-optimisations

The difference between clocking DRAM at either 528 MHz or 2112 MHz on RK3588 depends on the benchmark in question. Some (like the infamous sysbench or older benchmarks in general where the whole data fits inside modern CPU's caches) are not affected at all but for example with Geekbench 6 it looks like this (the individual tests are the interesting ones since they show which test depends on memory latency/bandwidth and which rather don't):

https://browser.geekbench.com/v6/cpu/compare/3337274?baseline=3336736

Joshua-Riek commented 8 months ago

There is a system script to set the CPU / GPU governor to performance on boot, to address some performance and micro stutter issues for desktop users. But if the DMC memory governor is set to performance by default that is not intentional.

ThomasKaiser commented 8 months ago

There is a system script to set the CPU / GPU governor to performance on boot

That's important to know for @geerlingguy since measuring idle performance is negatively affected by that.

With Jeff's use cases for the RK1 in mind I would switch back to ondemand if I were him and use /etc/sysfs.d/tk-optimize-rk3588.conf from the aforementioned link to get both low idle consumption as well as high overall performance since with RK's defaults (no io_is_busy and especially up_threshold = 40) the CPU clockspeeds won't ramp up quickly enough for mixed workloads (and also a lot of benchmarks).

alexellis commented 8 months ago

@geerlingguy re this comment:

https://github.com/geerlingguy/sbc-reviews/issues/25#issuecomment-1791634464

I flashed the same firmware from the link you gave to a microSD card, inserted it on the rear, booted and watched the serial console. It just boots normally without the prompt you saw. Did you do anything else like short any headers?

The docs also mention ssh being available, but this gives Permission denied, please try again.

geerlingguy commented 5 months ago

It looks like the modules are complete! See: https://twitter.com/turingpi/status/1746937693110403303

I checked on my order from August 15 (for 4x 8 GB models plus heatsinks) and so far it still says 'Processing'.

AnEvilPenguin commented 5 months ago

It looks like the modules are complete! See: https://twitter.com/turingpi/status/1746937693110403303

I checked on my order from August 15 (for 4x 8 GB models plus heatsinks) and so far it still says 'Processing'.

Mine turned up last night. Just flashing them now!

geerlingguy commented 5 months ago

@AnEvilPenguin - how goes testing? My shipment should arrive today! Hoping to re-test everything on a production unit and see how it goes.

AnEvilPenguin commented 5 months ago

Installation and flashing went smoothly (though I had to reboot each module myself - docs imply that they should handle it themselves).

Heatsinks are nice and quiet (especially compared to some of the Rpi PoE modules I've tried!). Detected all of my nvme drives with no issues (I've had issues there in the past).

Done a basic k3s setup and just need to figure out where I want to take it next. I've not really pushed them that hard yet, but so far they feel pretty snappy compared to my Rpi 4bs and Jetson Nano.

I can say that I'm pretty pleased with them! Overall a good experience so far! I look forward to hearing your thoughts as well!

nmaas87 commented 5 months ago

Got two 16 GB modules yesterday, looks like they decreased eMMC from 64 GB to 32 GB.

daniel-kukiela commented 5 months ago

Got two 16 GB modules yesterday, looks like they decreased eMMC from 64 GB to 32 GB.

All RK1 modules contain a 32GB flash. I checked the photo from the first post just to make sure about this specific unit, and the flash chip is a 32GB model. :)

- Daniel (Turing Machines)

daniel-kukiela commented 5 months ago

@geerlingguy Could we correct the flash size in your post? The flash size is 32GB. There are 2 places that mention 64GB - the basic information section and the benchmark results section. Thank you! :)

- Daniel (Turing Machines)

geerlingguy commented 5 months ago

@daniel-kukiela - Done! I hope to re-test with my new production copies this month!

daniel-kukiela commented 5 months ago

Thank you!

As for the disk speeds, for anyone wondering, the maximum speed should be roughly twice what the benchmark results show right now - there was a software bug that caused the flash chip to not be configured correctly, but recent OS versions have this fixed.

sidick commented 4 months ago

I got shipping notice for my couple of 32gb ram ones today

geerlingguy commented 4 months ago

Following up with some testing of the 32GB module (since the 16GB ones I've been testing are earlier revision, and I would like to base my final review on the 32GB ones): https://github.com/geerlingguy/sbc-reviews/issues/38

daniel-kukiela commented 4 months ago

Do you still consider updating the disk benchmark results here (or removing them)? The firmware version you used there was a pre-release and was not meant to be used for this purpose. The issue with the storage speed has been fixed with the first firmware release and these numbers here might suggest lower performance of the storage in 16GB modules. :)

nyanmisaka commented 4 months ago

@geerlingguy Just in case you are interested in the H264/H265 hardware decoding/encoding performance on RK3588. Try this ffmpeg-rockchip, which has RKMPP & RKRGA support, and it also has a Wiki page. Based on it, Jellyfin recently added hardware transcoding support for RK3588.

geerlingguy commented 4 months ago

@daniel-kukiela - Do you have a good guide for upgrading the firmware on the 16GB boards? Or is it just apt upgrade? I'll re-test tomorrow.

daniel-kukiela commented 4 months ago

@Joshua-Riek I believe apt upgrade should work in this case, right?

Joshua-Riek commented 4 months ago

Yes, that is correct.

daniel-kukiela commented 4 months ago

Thank you @Joshua-Riek . So, yes @geerlingguy, apt upgrade. Thank you!

geerlingguy commented 4 months ago

@daniel-kukiela - Updated the results for eMMC, looks like it is quite improved, thanks :)

daniel-kukiela commented 4 months ago

Thank you! :)

geerlingguy / sbc-reviews