Closed geerlingguy closed 4 months ago
First step for bringup was getting the boards working. I plugged them into my original Turing Pi 2 prototype board... which is a bit worse for the wear :)
I couldn't get them to boot, and that old board didn't have PWM fan headers either, so I popped them into my kickstarter "v2.4" board, and they booted and had fans running full blast—but I was told I should upgrade to the 2.0.0 (currently RC1) firmware to get full functionality.
So I initially tried updating over the BMC Web UI, and found that later versions only ship an .img file, not an .swu
file, so you have to do one of the more 'advanced' BMC update methods.
I initially tried using PhoenixSuit
on my Windows PC, but kept running into all kinds of permissions issues (would not like to work in an ecosystem that requires disabling so many security features, heh...), then was told I can just flash the .img to microSD, boot off it, and get it upgraded that way.
So it was as easy as:
CONFIRM
over UART and press enter, the firmware will upgrade.After removing the microSD card and rebooting the board, it shows:
_____ _ _ ____ ___ _ _ ____
|_ _| | | | _ \|_ _| \ | |/ ___|
| | | | | | |_) || || \| | | _
| | | |_| | _ < | || |\ | |_| |
|_| \___/|_| \_\___|_| \_|\____|
Welcome to Turing Pi
turingpi login: 2023-11-02T22:04:00.464Z INFO [bmcd] Turing Pi 2 BMC Daemon v2.0.0
Yay!
I also attempted to install the RK1 into a Jetson development board. The product page states:
The Turing RK1 compute module is compatible with Nvidia Jetson pin layout. This means you can plug it into any Nvidia Jetson carrier board.
Unfortunately, the tiny retention pins on the slot on my carrier board seemed to smash into two tiny components on either side of RK1 making it close to impossible to reliably insert the board (without bending some of the metal tabs...).
The Jetson Nano has two tiny holes in the PCB in those areas to accept the little underside retention tabs, the RK1 does not. I'll upload a picture of the problem if I can remember :)
So back to Turing Pi 2 we go!
It looks like @Joshua-Riek maintains a bunch of images for Rockchip RK3588 devices, and supports RK1 in that repo: https://github.com/Joshua-Riek/ubuntu-rockchip/releases/tag/v1.29
I've downloaded ubuntu-22.04.3-preinstalled-server-arm64-turing-rk1.img.xz
and am flashing it to node 1 via the BMC web UI... fingers crossed!
In the BMC logs:
2023-11-02T22:20:05.360Z INFO [bmcd::persistency::app_persistency] commiting persistency to disk
2023-11-02T22:30:25.600Z INFO [bmcd::api::streaming_data_service] #1779206829 'node 1 upgrade service' 6.58 GB - started
2023-11-02T22:30:25.607Z INFO [bmcd::app::bmc_application] Powering off node Node1...
2023-11-02T22:30:26.219Z INFO [bmcd::app::bmc_application] Prerequisite settings toggled, powering on...
2023-11-02T22:30:27.441Z INFO [bmcd::app::bmc_application] Checking for presence of a USB device...
2023-11-02T22:30:27.448Z INFO [bmcd::firmware_update::rockusb_fwudate] Maskrom mode detected. loading usb-plug..
2023-11-02T22:30:28.888Z INFO [bmcd::app::firmware_runner] started writing to node 1
...
After all that, I got:
2023-11-02T22:30:28.888Z INFO [bmcd::app::firmware_runner] started writing to node 1
2023-11-02T22:35:26.221Z INFO [bmcd::persistency::app_persistency] commiting persistency to disk
2023-11-02T23:31:21.412Z INFO [bmcd::app::firmware_runner] Verifying checksum of written data to node 1
2023-11-02T23:46:51.843Z INFO [bmcd::app::firmware_runner] Flashing node 1 successful, restarting device...
2023-11-02T23:46:52.269Z INFO [bmcd::api::streaming_data_service] worker done. took 1h 16m 26s 669ms 421us 43ns (#1779206829)
thread 'main' panicked at 'source slice length (0) does not match destination slice length (16384)', src/utils/ring_buf.rs:22:33
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
And the web UI showed: Upgrade failed: [object Object]
Trying with an updated version from https://github.com/Joshua-Riek/ubuntu-rockchip/actions/runs/6740498946
After the flash, the BMC UI still showed Upgrade failed: [object Object]
, but over in the UART console:
2023-11-03T13:54:23.395Z INFO [bmcd::app::firmware_runner] started writing to node 1
2023-11-03T14:55:39.577Z INFO [bmcd::app::firmware_runner] Verifying checksum of written data to node 1
2023-11-03T15:11:02.460Z INFO [bmcd::app::firmware_runner] Flashing node 1 successful, restarting device...
2023-11-03T15:11:02.949Z INFO [bmcd::api::streaming_data_service] worker done. took 1h 16m 42s 748ms 472us 61ns (#3864914314)
...
...it looks like the RK1 is booted, but it's not giving me any SSH access. Default user/pass should be ubuntu
(both).
Rebooted again, and now I've found it on the network, logged in and running some tests :)
Once the Ubuntu image is loading, it seems the PWM fan ramps down at idle, which is nice. It only ramps up to full speed if you're burning all cores always.
The little orange Ethernet LED also works now—with the factory buildroot test image, only the green ACT LED would be blinking for Ethernet.
I'm going to flash the other two now.
(The fans are blissfully quiet now that they're all on PWM.)
It might need a moment for cloud-init to do its thing. But the default credentials are ubuntu and ubuntu. You can connect to the node through UART on the BMC if ssh does not want to work.
On the BMC use microcom to connect
microcom -s 115200 /dev/ttySX
Slot number to devnode: | Slot | Devnode |
---|---|---|
Slot 1 | /dev/ttyS2 |
|
Slot 2 | /dev/ttyS1 |
|
Slot 3 | /dev/ttyS4 |
|
Slot 4 | /dev/ttyS5 |
That's a nice flashing experience, better than I was expecting having worked with the CM4. So you insert an SD card with an image, then use the BMC over HTTP to flash it to a node's eMMC? Is the serial console over USB a must - or would the HTTP BMC work on its own?
@Joshua-Riek Your images look like great community resource. Those build times look really quite long. Are you using QEMU for them, or how do you build them? This is why I ask - we've been building on bare-metal aarch64 servers and seeing huge improvements in build time.
I cross-compile the bootloader and kernel, but then the rootfs is built with QEMU. I have an army of RK3588 dev boards that would be perfect to set up as GitHub runners to speed things up, but lack the space right now to do so. I need to optimize the build process and figure out how best to proceed. I never thought I would need this much computational power when the project started, but it should only grow from here.
All my benchmarks are a few percentage points faster than Rock 5 B, which is interesting.
All my benchmarks are a few percentage points faster than Rock 5 B, which is interesting.
According to the submitted sbc-bench result the DMC memory governor was set to performance
(no idea whether that's @Joshua-Riek's intention or caused by running sbc-bench -r
prior which adjust the DMC memory governor to 'always top speed'). So if you ran Geekbench after sbc-bench -r
w/o any reboot in between Geekbench was executed with DRAM clocked all the time at 2112 MHz.
When you benchmarked your Rock 5B most probably the software shipped back then defaulted to dmc_ondemand
memory governor adjusting memory clock dynamically between 528 MHz and 2112 MHz based on CPU utilization. Actual benchmark scores then did depend on the up_treshold
value, see at the very bottom of:
The difference between clocking DRAM at either 528 MHz or 2112 MHz on RK3588 depends on the benchmark in question. Some (like the infamous sysbench
or older benchmarks in general where the whole data fits inside modern CPU's caches) are not affected at all but for example with Geekbench 6 it looks like this (the individual tests are the interesting ones since they show which test depends on memory latency/bandwidth and which rather don't):
https://browser.geekbench.com/v6/cpu/compare/3337274?baseline=3336736
There is a system script to set the CPU / GPU governor to performance
on boot, to address some performance and micro stutter issues for desktop users. But if the DMC memory governor is set to performance
by default that is not intentional.
There is a system script to set the CPU / GPU governor to
performance
on boot
That's important to know for @geerlingguy since measuring idle performance is negatively affected by that.
With Jeff's use cases for the RK1 in mind I would switch back to ondemand
if I were him and use /etc/sysfs.d/tk-optimize-rk3588.conf
from the aforementioned link to get both low idle consumption as well as high overall performance since with RK's defaults (no io_is_busy
and especially up_threshold = 40
) the CPU clockspeeds won't ramp up quickly enough for mixed workloads (and also a lot of benchmarks).
@geerlingguy re this comment:
https://github.com/geerlingguy/sbc-reviews/issues/25#issuecomment-1791634464
I flashed the same firmware from the link you gave to a microSD card, inserted it on the rear, booted and watched the serial console. It just boots normally without the prompt you saw. Did you do anything else like short any headers?
The docs also mention ssh being available, but this gives Permission denied, please try again.
It looks like the modules are complete! See: https://twitter.com/turingpi/status/1746937693110403303
I checked on my order from August 15 (for 4x 8 GB models plus heatsinks) and so far it still says 'Processing'.
It looks like the modules are complete! See: https://twitter.com/turingpi/status/1746937693110403303
I checked on my order from August 15 (for 4x 8 GB models plus heatsinks) and so far it still says 'Processing'.
Mine turned up last night. Just flashing them now!
@AnEvilPenguin - how goes testing? My shipment should arrive today! Hoping to re-test everything on a production unit and see how it goes.
Installation and flashing went smoothly (though I had to reboot each module myself - docs imply that they should handle it themselves).
Heatsinks are nice and quiet (especially compared to some of the Rpi PoE modules I've tried!). Detected all of my nvme drives with no issues (I've had issues there in the past).
Done a basic k3s setup and just need to figure out where I want to take it next. I've not really pushed them that hard yet, but so far they feel pretty snappy compared to my Rpi 4bs and Jetson Nano.
I can say that I'm pretty pleased with them! Overall a good experience so far! I look forward to hearing your thoughts as well!
Got two 16 GB modules yesterday, looks like they decreased eMMC from 64 GB to 32 GB.
Got two 16 GB modules yesterday, looks like they decreased eMMC from 64 GB to 32 GB.
All RK1 modules contain a 32GB flash. I checked the photo from the first post just to make sure about this specific unit, and the flash chip is a 32GB model. :)
- Daniel (Turing Machines)
@geerlingguy Could we correct the flash size in your post? The flash size is 32GB. There are 2 places that mention 64GB - the basic information section and the benchmark results section. Thank you! :)
- Daniel (Turing Machines)
@daniel-kukiela - Done! I hope to re-test with my new production copies this month!
Thank you!
As for the disk speeds, for anyone wondering, the maximum speed should be roughly twice what the benchmark results show right now - there was a software bug that caused the flash chip to not be configured correctly, but recent OS versions have this fixed.
I got shipping notice for my couple of 32gb ram ones today
Following up with some testing of the 32GB module (since the 16GB ones I've been testing are earlier revision, and I would like to base my final review on the 32GB ones): https://github.com/geerlingguy/sbc-reviews/issues/38
Do you still consider updating the disk benchmark results here (or removing them)? The firmware version you used there was a pre-release and was not meant to be used for this purpose. The issue with the storage speed has been fixed with the first firmware release and these numbers here might suggest lower performance of the storage in 16GB modules. :)
@geerlingguy Just in case you are interested in the H264/H265 hardware decoding/encoding performance on RK3588. Try this ffmpeg-rockchip, which has RKMPP & RKRGA support, and it also has a Wiki page. Based on it, Jellyfin recently added hardware transcoding support for RK3588.
@daniel-kukiela - Do you have a good guide for upgrading the firmware on the 16GB boards? Or is it just apt upgrade
? I'll re-test tomorrow.
@Joshua-Riek I believe apt upgrade
should work in this case, right?
Yes, that is correct.
Thank you @Joshua-Riek .
So, yes @geerlingguy, apt upgrade
. Thank you!
@daniel-kukiela - Updated the results for eMMC, looks like it is quite improved, thanks :)
Thank you! :)
NOTE: These benchmarks are preliminary (run on an early preproduction batch of the board)—I will be re-testing with my production boards soon!
Basic information
Linux/system information
Benchmark results
CPU
Power
(Currently can't measure individual node power consumption.)
stress-ng --matrix 0
): TODO Wtop500
HPL benchmark: TODO WDisk
Built in eMMC (32GB)
curl https://raw.githubusercontent.com/geerlingguy/pi-cluster/master/benchmarks/disk-benchmark.sh | sudo bash
Run benchmark on any attached storage device (e.g. eMMC, microSD, NVMe, SATA) and add results under an additional heading. Download the script with
curl -o disk-benchmark.sh [URL_HERE]
and runsudo DEVICE_UNDER_TEST=/dev/sda DEVICE_MOUNT_PATH=/mnt/sda1 ./disk-benchmark.sh
(assuming the device issda
).Also consider running PiBenchmarks.com script.
Network
iperf3
results:iperf3 -c $SERVER_IP
: 943 Mbpsiperf3 --reverse -c $SERVER_IP
: 927 Mbpsiperf3 --bidir -c $SERVER_IP
: 939 Mbps up, 236 Mbps down(Be sure to test all interfaces, noting any that are non-functional.)
GPU
Memory
tinymembench
results:Click to expand memory benchmark result
``` tinymembench v0.4.10 (simple benchmark for memory throughput and latency) ========================================================================== == Memory bandwidth tests == == == == Note 1: 1MB = 1000000 bytes == == Note 2: Results for 'copy' tests show how many bytes can be == == copied per second (adding together read and writen == == bytes would have provided twice higher numbers) == == Note 3: 2-pass copy means that we are using a small temporary buffer == == to first fetch data into it, and only then write it to the == == destination (source -> L1 cache, L1 cache -> destination) == == Note 4: If sample standard deviation exceeds 0.1%, it is shown in == == brackets == ========================================================================== C copy backwards : 11768.5 MB/s (21.3%) C copy backwards (32 byte blocks) : 11745.9 MB/s C copy backwards (64 byte blocks) : 11746.9 MB/s C copy : 12041.0 MB/s C copy prefetched (32 bytes step) : 12222.2 MB/s C copy prefetched (64 bytes step) : 12252.1 MB/s C 2-pass copy : 5175.5 MB/s (0.5%) C 2-pass copy prefetched (32 bytes step) : 7927.3 MB/s C 2-pass copy prefetched (64 bytes step) : 8370.9 MB/s C fill : 30185.7 MB/s (0.4%) C fill (shuffle within 16 byte blocks) : 30183.5 MB/s C fill (shuffle within 32 byte blocks) : 30183.9 MB/s C fill (shuffle within 64 byte blocks) : 30183.6 MB/s NEON 64x2 COPY : 12238.2 MB/s NEON 64x2x4 COPY : 12171.9 MB/s NEON 64x1x4_x2 COPY : 5605.7 MB/s NEON 64x2 COPY prefetch x2 : 11016.5 MB/s NEON 64x2x4 COPY prefetch x1 : 11359.5 MB/s NEON 64x2 COPY prefetch x1 : 11104.4 MB/s NEON 64x2x4 COPY prefetch x1 : 11359.0 MB/s --- standard memcpy : 12208.5 MB/s standard memset : 30184.2 MB/s (0.6%) --- NEON LDP/STP copy : 12232.0 MB/s NEON LDP/STP copy pldl2strm (32 bytes step) : 12201.4 MB/s NEON LDP/STP copy pldl2strm (64 bytes step) : 12224.4 MB/s NEON LDP/STP copy pldl1keep (32 bytes step) : 12251.8 MB/s NEON LDP/STP copy pldl1keep (64 bytes step) : 12247.8 MB/s NEON LD1/ST1 copy : 12175.6 MB/s NEON STP fill : 30185.9 MB/s (0.6%) NEON STNP fill : 30183.7 MB/s ARM LDP/STP copy : 12215.6 MB/s ARM STP fill : 30181.5 MB/s (0.6%) ARM STNP fill : 30182.1 MB/s ========================================================================== == Framebuffer read tests. == == == == Many ARM devices use a part of the system memory as the framebuffer, == == typically mapped as uncached but with write-combining enabled. == == Writes to such framebuffers are quite fast, but reads are much == == slower and very sensitive to the alignment and the selection of == == CPU instructions which are used for accessing memory. == == == == Many x86 systems allocate the framebuffer in the GPU memory, == == accessible for the CPU via a relatively slow PCI-E bus. Moreover, == == PCI-E is asymmetric and handles reads a lot worse than writes. == == == == If uncached framebuffer reads are reasonably fast (at least 100 MB/s == == or preferably >300 MB/s), then using the shadow framebuffer layer == == is not necessary in Xorg DDX drivers, resulting in a nice overall == == performance improvement. For example, the xf86-video-fbturbo DDX == == uses this trick. == ========================================================================== NEON LDP/STP copy (from framebuffer) : 1942.3 MB/s (0.1%) NEON LDP/STP 2-pass copy (from framebuffer) : 1654.2 MB/s NEON LD1/ST1 copy (from framebuffer) : 1941.0 MB/s NEON LD1/ST1 2-pass copy (from framebuffer) : 1661.1 MB/s ARM LDP/STP copy (from framebuffer) : 1875.6 MB/s ARM LDP/STP 2-pass copy (from framebuffer) : 1655.5 MB/s ========================================================================== == Memory latency test == == == == Average time is measured for random memory accesses in the buffers == == of different sizes. The larger is the buffer, the more significant == == are relative contributions of TLB, L1/L2 cache misses and SDRAM == == accesses. For extremely large buffer sizes we are expecting to see == == page table walk with several requests to SDRAM for almost every == == memory access (though 64MiB is not nearly large enough to experience == == this effect to its fullest). == == == == Note 1: All the numbers are representing extra time, which needs to == == be added to L1 cache latency. The cycle timings for L1 cache == == latency can be usually found in the processor documentation. == == Note 2: Dual random read means that we are simultaneously performing == == two independent memory accesses at a time. In the case if == == the memory subsystem can't handle multiple outstanding == == requests, dual random read has the same timings as two == == single reads performed one after another. == ========================================================================== block size : single random read / dual random read 1024 : 0.0 ns / 0.0 ns 2048 : 0.0 ns / 0.0 ns 4096 : 0.0 ns / 0.0 ns 8192 : 0.0 ns / 0.0 ns 16384 : 0.0 ns / 0.0 ns 32768 : 0.0 ns / 0.0 ns 65536 : 0.0 ns / 0.0 ns 131072 : 1.1 ns / 1.5 ns 262144 : 2.3 ns / 2.9 ns 524288 : 4.4 ns / 5.7 ns 1048576 : 10.1 ns / 13.0 ns 2097152 : 14.3 ns / 16.5 ns 4194304 : 62.3 ns / 92.7 ns 8388608 : 101.6 ns / 135.8 ns 16777216 : 97.3 ns / 139.9 ns 33554432 : 151.3 ns / 164.4 ns 67108864 : 141.1 ns / 161.7 ns ```sbc-bench
resultssbc-bench results
Phoronix Test Suite
Results from pi-general-benchmark.sh: