Open CramBL opened 1 month ago
On Raspberry Pi 5:
flamegraph --root -o listen_fgph.svg -- sudo /usr/sbin/sshd -D -p 54321
On Desktop Pop OS 22.04:
scp 400mb.data pi@192.168.0.200:~/tmp/
Note that most of the late half of the daemon invocation is just idle time after the transfer completes
On Raspberry Pi 5:
flamegraph -o listen_fgph.svg -- usr/bin/qft listen --port 54321 -o test.data
On Desktop Pop OS 22.04:
qft send ip 192.168.0.200 --port 54321 -f 400mb.data
Once more
qft listen --port 54321 -o test.data -d lz4
It seems scp
exits before the transfer is complete while qft
wait for the server to OK end of transfer, which explains some of the discrepancy.
The disk IO (SD card) is the bottleneck based on the flamegraphs showing most of the time being spend in generic_perform_write
.
More profiling on transfers between MCUs instead of on a Desktop to optimize for disk IO both on the sender and receiver.
io_uring
on the receiverpi@rpi5:~$ time scp 94mb.json alice@192.168.0.234:~/
94mb.json
real 0m25.565s
user 0m0.503s
sys 0m0.433s
pi@rpi5:~$ time scp 94mb.json alice@192.168.0.234:~/
real 0m23.280s
user 0m0.493s
sys 0m0.421s
pi@rpi5:~$ time qft ssh 94mb.json alice@192.168.0.234:~/
real 0m17.328s
user 0m4.995s
sys 0m11.831s
pi@rpi5:~$ time qft ssh 94mb.json alice@192.168.0.234:~/
real 0m18.261s
user 0m5.234s
sys 0m12.525s
time scp 394mb.json alice@192.168.0.234:~/
394mb.json
real 1m49.440s
user 0m2.004s
sys 0m1.186s
time qft ssh 394mb.json alice@192.168.0.234:~/
real 1m16.037s
user 0m22.163s
sys 0m53.494s
time rsync 394mb.json alice@192.168.0.234:~/bla.json
real 1m49.678s
user 0m1.944s
sys 0m0.932s
hyperfine "qft ssh -q 94mb.json alice@192.168.0.234:~/" "scp -q 94mb.json alice@192.168.0.234:~/" --export-markdown bench_94mb.m
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
qft ssh -q 94mb.json alice@192.168.0.234:~/ |
19.172 ± 3.453 | 15.593 | 28.252 | 1.00 |
scp -q 94mb.json alice@192.168.0.234:~/ |
24.929 ± 2.170 | 22.629 | 28.436 | 1.30 ± 0.26 |
Not bad for qft
.
time scp 94mb.json pi@192.168.0.200:~/
real 0m22.450s
user 0m7.481s
sys 0m13.186s
real 0m21.835s
user 0m7.870s
sys 0m13.151s
time qft ssh 94mb.json pi@192.168.0.200:~/
real 0m23.296s
user 0m6.028s
sys 0m15.434s
real 0m23.490s
user 0m6.976s
sys 0m14.965s
qft listen
qft ssh 94mb.json pi@192.168.0.200:~/
real 1m33.412s
user 0m26.302s
sys 1m0.735s
real 1m34.077s
user 0m25.386s
sys 1m1.792s
real 1m33.451s
user 0m32.473s
sys 0m53.082s
time scp -q 394mb.json pi@192.168.0.200:~/
real 1m37.472s
user 0m33.170s
sys 0m52.426s
real 1m34.407s
user 0m33.463s
sys 0m52.478s
real 1m34.007s
user 0m27.359s
sys 1m0.803s
On a Raspberry PI 4B transferring 500 MB rauc bundle to a Zynq-7000 arm 32-bit with qft v0.10.1 the results are decent:
time qft ssh test.raucb root@hostname.local.:~/
real 0m58.592s
user 0m16.036s
sys 0m42.955s
time scp test.raucb root@hostname.local.:~/
real 1m7.451s
user 0m6.639s
sys 0m4.616s
TCP_STREAM_BUFSIZE
Varying the TCP_STREAM_BUFSIZE
on both hosts, the suffix to qft
is the factor 1024 was multiplied with to determine the TCP_STREAM_BUFSIZE. e.g. qft8 has a TCP_STREAM_BUFSIZE of 8*1024
bytes.
Benchmark 1: qft1 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.895 s ± 1.213 s [User: 2.743 s, System: 6.428 s]
Range (min … max): 6.670 s … 10.814 s 10 runs
Benchmark 2: qft2 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.454 s ± 0.194 s [User: 2.612 s, System: 6.099 s]
Range (min … max): 9.180 s … 9.710 s 10 runs
Benchmark 3: qft3 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.765 s ± 0.531 s [User: 2.597 s, System: 6.071 s]
Range (min … max): 9.241 s … 10.808 s 10 runs
Benchmark 4: qft4 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.726 s ± 0.503 s [User: 2.664 s, System: 6.286 s]
Range (min … max): 9.098 s … 10.717 s 10 runs
Benchmark 5: qft8 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.333 s ± 0.277 s [User: 2.618 s, System: 5.971 s]
Range (min … max): 8.926 s … 9.859 s 10 runs
Benchmark 6: qft16 ssh cargolock.data alice@pi.hole:~/
Time (mean ± σ): 9.361 s ± 0.245 s [User: 2.583 s, System: 6.029 s]
Range (min … max): 9.050 s … 9.725 s 10 runs
Summary
qft8 ssh cargolock.data alice@pi.hole:~/ ran
1.00 ± 0.04 times faster than qft16 ssh cargolock.data alice@pi.hole:~/
1.01 ± 0.04 times faster than qft2 ssh cargolock.data alice@pi.hole:~/
1.04 ± 0.06 times faster than qft4 ssh cargolock.data alice@pi.hole:~/
1.05 ± 0.06 times faster than qft3 ssh cargolock.data alice@pi.hole:~/
1.06 ± 0.13 times faster than qft1 ssh cargolock.data alice@pi.hole:~/
Raucbundle of 458MB.
Not bad! This batches file reads in 10 batches of 8192 bytes each. The TCP writes are not batched but they do use io_uring.
The first transfer is completely cold and still outperforms the second transfer. Quite a change considering the bottleneck is definitely the RPI0's TCP and file IO. Unfortunately the RPI does not support io_uring.
pi@rpi5:~$ time ./qft ssh /tmp/test.raucb alice@pi.hole:/tmp --io-uring
INFO Sending 1 file(s)
INFO Using io_uring 🚀
INFO Sent /tmp/test.raucb 457.13 MiB [479335834 B]
real 1m20.959s
user 0m24.134s
sys 0m57.551s
pi@rpi5:~$ uname -r
6.5.0-1020-raspi
pi@rpi5:~$ time ./qft ssh /tmp/test.raucb alice@pi.hole:/tmp
INFO Sending 1 file(s)
INFO Sent /tmp/test.raucb 457.13 MiB [479335834 B]
real 1m27.113s
user 0m25.585s
sys 1m0.767s
pi@rpi5:~$ time scp /tmp/test.raucb alice@pi.hole:/tmp
test.raucb 100% 457MB 3.8MB/s 02:00
real 2m1.584s
user 0m2.364s
sys 0m1.320s
perf stat qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb --io-uring
INFO Sending 1 file(s)
INFO Using io_uring 🚀
INFO Sent /tmp/test.raucb 457.13 MiB [479335834 B]
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb --io-uring':
999.34 msec task-clock # 0.238 CPUs utilized
25722 context-switches # 25.739 K/sec
12 cpu-migrations # 12.008 /sec
275 page-faults # 275.182 /sec
1599109722 cycles # 1.600 GHz
2383469888 instructions # 1.49 insn per cycle
442937654 branches # 443.230 M/sec
1886663 branch-misses # 0.43% of all branches
4.201624551 seconds time elapsed
0.396254000 seconds user
0.650373000 seconds sys
perf stat qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb
INFO Sending 1 file(s)
INFO Sent /tmp/test.raucb 457.13 MiB [479335834 B]
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb':
372.24 msec task-clock # 0.088 CPUs utilized
585 context-switches # 1.572 K/sec
3 cpu-migrations # 8.059 /sec
210 page-faults # 564.153 /sec
594685363 cycles # 1.598 GHz
789440136 instructions # 1.33 insn per cycle
110214995 branches # 296.086 M/sec
342408 branch-misses # 0.31% of all branches
4.253325202 seconds time elapsed
0.058846000 seconds user
0.313847000 seconds sys
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb' (10 runs):
358.28 msec task-clock # 0.084 CPUs utilized ( +- 1.55% )
555 context-switches # 1.549 K/sec ( +- 0.95% )
1 cpu-migrations # 2.791 /sec ( +- 31.45% )
210 page-faults # 586.130 /sec ( +- 0.05% )
573778347 cycles # 1.601 GHz ( +- 1.60% )
789707298 instructions # 1.38 insn per cycle ( +- 0.72% )
110284347 branches # 307.814 M/sec ( +- 1.06% )
344538 branch-misses # 0.31% of all branches ( +- 5.74% )
4.2875 +- 0.0281 seconds time elapsed ( +- 0.65% )
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb --io-uring' (10 runs):
993.93 msec task-clock # 0.232 CPUs utilized ( +- 0.44% )
25708 context-switches # 25.865 K/sec ( +- 0.33% )
12 cpu-migrations # 12.073 /sec ( +- 13.96% )
277 page-faults # 278.691 /sec ( +- 0.21% )
1588412005 cycles # 1.598 GHz ( +- 0.84% )
2393716992 instructions # 1.51 insn per cycle ( +- 0.46% )
445005094 branches # 447.722 M/sec ( +- 0.50% )
1971949 branch-misses # 0.44% of all branches ( +- 3.67% )
4.2919 +- 0.0116 seconds time elapsed ( +- 0.27% )
No cpu migrations and few context-switches, needs well-tuned non-blocking io to beat this.
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q' (10 runs):
342.07 msec task-clock # 0.073 CPUs utilized ( +- 0.97% )
572 context-switches # 1.672 K/sec ( +- 1.32% )
0 cpu-migrations # 0.000 /sec
229 page-faults # 669.452 /sec
544408042 cycles # 1.592 GHz ( +- 1.00% )
780091073 instructions # 1.43 insn per cycle ( +- 0.56% )
108305757 branches # 316.618 M/sec ( +- 0.83% )
335093 branch-misses # 0.31% of all branches ( +- 4.92% )
4.655 +- 0.162 seconds time elapsed ( +- 3.49% )
Lots of CPU migrations and context switches
Performance counter stats for 'qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
624.08 msec task-clock # 0.143 CPUs utilized ( +- 0.56% )
12212 context-switches # 19.568 K/sec ( +- 0.24% )
7 cpu-migrations # 11.217 /sec ( +- 23.18% )
294 page-faults # 471.096 /sec ( +- 0.24% )
980788675 cycles # 1.572 GHz ( +- 0.77% )
1372125800 instructions # 1.40 insn per cycle ( +- 0.59% )
234772016 branches # 376.191 M/sec ( +- 0.71% )
1082223 branch-misses # 0.46% of all branches ( +- 4.98% )
4.3620 +- 0.0959 seconds time elapsed ( +- 2.20% )
The buffer size and batch count is indicated by the binary name in the pattern qft_<BUFSIZE>_<BATCH_COUNT>
.
i.e. qft_4k_32
uses a uring depth of 32 with a slice/buffer size of 4096 bytes.
Performance counter stats for 'qft_4k_16 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
2920.23 msec task-clock # 0.687 CPUs utilized ( +- 0.09% )
234145 context-switches # 80.180 K/sec ( +- 0.06% )
228 cpu-migrations # 78.076 /sec ( +- 2.27% )
248 page-faults # 84.925 /sec ( +- 0.29% )
4857468621 cycles # 1.663 GHz ( +- 0.23% )
7379999588 instructions # 1.52 insn per cycle ( +- 0.10% )
1523850908 branches # 521.825 M/sec ( +- 0.10% )
9703408 branch-misses # 0.64% of all branches ( +- 1.03% )
4.2487 +- 0.0210 seconds time elapsed ( +- 0.49% )
Performance counter stats for 'qft_4k_32 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
2932.89 msec task-clock # 0.679 CPUs utilized ( +- 0.10% )
234405 context-switches # 79.923 K/sec ( +- 0.08% )
230 cpu-migrations # 78.421 /sec ( +- 1.14% )
268 page-faults # 91.377 /sec ( +- 0.18% )
4861247035 cycles # 1.657 GHz ( +- 0.31% )
7362958065 instructions # 1.51 insn per cycle ( +- 0.16% )
1520494682 branches # 518.429 M/sec ( +- 0.15% )
9641872 branch-misses # 0.63% of all branches ( +- 1.05% )
4.3167 +- 0.0373 seconds time elapsed ( +- 0.87%
Performance counter stats for 'qft_4k_64 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
2948.66 msec task-clock # 0.678 CPUs utilized ( +- 0.08% )
234115 context-switches # 79.397 K/sec ( +- 0.06% )
236 cpu-migrations # 80.036 /sec ( +- 1.56% )
304 page-faults # 103.098 /sec ( +- 0.23% )
4922522759 cycles # 1.669 GHz ( +- 0.27% )
7392280292 instructions # 1.50 insn per cycle ( +- 0.12% )
1526417610 branches # 517.664 M/sec ( +- 0.11% )
10003319 branch-misses # 0.66% of all branches ( +- 0.83% )
4.3501 +- 0.0393 seconds time elapsed ( +- 0.90% )
Performance counter stats for 'qft_4k_64 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
2944.85 msec task-clock # 0.687 CPUs utilized ( +- 0.11% )
234216 context-switches # 79.534 K/sec ( +- 0.06% )
232 cpu-migrations # 78.781 /sec ( +- 1.29% )
304 page-faults # 103.231 /sec ( +- 0.11% )
4909014114 cycles # 1.667 GHz ( +- 0.27% )
7384111146 instructions # 1.50 insn per cycle ( +- 0.10% )
1524792489 branches # 517.782 M/sec ( +- 0.10% )
9898644 branch-misses # 0.65% of all branches ( +- 0.94% )
4.2884 +- 0.0288 seconds time elapsed ( +- 0.67% )
Performance counter stats for 'qft_4k_128 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
2976.79 msec task-clock # 0.691 CPUs utilized ( +- 0.08% )
234439 context-switches # 78.756 K/sec ( +- 0.07% )
244 cpu-migrations # 81.967 /sec ( +- 2.17% )
379 page-faults # 127.318 /sec ( +- 0.08% )
4969488126 cycles # 1.669 GHz ( +- 0.48% )
7425788983 instructions # 1.49 insn per cycle ( +- 0.33% )
1533401150 branches # 515.119 M/sec ( +- 0.32% )
10298570 branch-misses # 0.67% of all branches ( +- 1.20% )
4.3066 +- 0.0350 seconds time elapsed ( +- 0.81% )
Performance counter stats for 'qft_4k_256 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring' (10 runs):
3041.61 msec task-clock # 0.712 CPUs utilized ( +- 0.30% )
234150 context-switches # 76.982 K/sec ( +- 0.07% )
230 cpu-migrations # 75.618 /sec ( +- 2.24% )
524 page-faults # 172.277 /sec ( +- 0.11% )
5174027620 cycles # 1.701 GHz ( +- 0.60% )
7419969792 instructions # 1.43 insn per cycle ( +- 0.14% )
1532175650 branches # 503.739 M/sec ( +- 0.14% )
10165183 branch-misses # 0.66% of all branches ( +- 1.16% )
4.2696 +- 0.0345 seconds time elapsed ( +- 0.81% )
With this io_uring implementation and 458M rauc bundle
pub fn incremental_rw_io_uring_batch_alt2<const BUFSIZE: usize, const BATCH_COUNT: usize>(
socket: &TcpStream,
file_path: &Path,
) -> anyhow::Result<u64> {
let file = File::open(file_path)?;
let buffers = [[0u8; BUFSIZE]; BATCH_COUNT];
let mut total_read: u64 = 0;
let mut total_write: u64 = 0;
let config = rio::Config {
depth: BATCH_COUNT,
..Default::default()
};
let uring = config.start()?;
let mut read_futures = VecDeque::with_capacity(BATCH_COUNT);
// Initialize read futures
for batch_idx in 0..BATCH_COUNT {
let file_offset = batch_idx as u64 * BUFSIZE as u64;
let read_future = uring.read_at(&file, &buffers[batch_idx], file_offset);
read_futures.push_back((batch_idx, read_future));
}
while let Some((idx, read_future)) = read_futures.pop_front() {
let bytes_read = read_future.wait()?;
if bytes_read == 0 {
break; // EOF
}
total_read += bytes_read as u64;
let buf = &buffers[idx][..bytes_read];
total_write += write_at_socket(&uring, socket, buf, bytes_read)? as u64;
debug_assert_eq!(total_read, total_write);
// Re-issue the read for the same buffer
let file_offset = total_read + ((BUFSIZE as u64 * BATCH_COUNT as u64) - BUFSIZE as u64);
read_futures.push_back((idx, uring.read_at(&file, &buffers[idx], file_offset)));
}
assert_eq!(
total_read, total_write,
"Total bytes read does not equal total bytes written"
);
Ok(total_read)
}
#[inline(always)]
fn write_at_socket(
uring: &Rio,
socket: &TcpStream,
buf: &[u8],
write_count: usize,
) -> std::io::Result<usize> {
let mut total_written = 0;
while total_written < write_count {
let tmp_write_buf = &buf[total_written..];
let written_bytes = uring.send(socket, &tmp_write_buf).wait()?;
if written_bytes == 0 {
return Err(io::Error::new(
io::ErrorKind::WriteZero,
"Failed writing any bytes to the socket",
));
}
total_written += written_bytes;
}
Ok(total_written)
}
Same as above with 94MB json
A couple of medium sized files
Command | Mean [s] | Min [s] | Max [s] | Relative |
---|---|---|---|---|
qft send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q |
4.215 ± 0.035 | 4.183 | 4.289 | 1.00 ± 0.01 |
qft_2k_16 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.339 ± 0.138 | 4.191 | 4.634 | 1.03 ± 0.03 |
qft_2k_128 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.410 ± 0.117 | 4.194 | 4.597 | 1.05 ± 0.03 |
qft_4k_16 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.300 ± 0.152 | 4.189 | 4.657 | 1.02 ± 0.04 |
qft_4k_32 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.228 ± 0.050 | 4.181 | 4.300 | 1.01 ± 0.01 |
qft_4k_64 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.199 ± 0.015 | 4.186 | 4.240 | 1.00 |
qft_4k_128 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.229 ± 0.085 | 4.176 | 4.451 | 1.01 ± 0.02 |
qft_4k_256 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.224 ± 0.053 | 4.173 | 4.312 | 1.01 ± 0.01 |
qft_8k_16 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.255 ± 0.065 | 4.190 | 4.385 | 1.01 ± 0.02 |
qft_8k_32 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.215 ± 0.038 | 4.187 | 4.288 | 1.00 ± 0.01 |
qft_8k_64 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.214 ± 0.046 | 4.172 | 4.288 | 1.00 ± 0.01 |
qft_8k_128 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.288 ± 0.090 | 4.190 | 4.453 | 1.02 ± 0.02 |
qft_8k_256 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.270 ± 0.139 | 4.181 | 4.549 | 1.02 ± 0.03 |
qft_16k_8 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.231 ± 0.059 | 4.183 | 4.358 | 1.01 ± 0.01 |
qft_16k_16 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.246 ± 0.131 | 4.171 | 4.606 | 1.01 ± 0.03 |
qft_16k_32 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.221 ± 0.043 | 4.189 | 4.286 | 1.01 ± 0.01 |
qft_16k_64 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.237 ± 0.103 | 4.187 | 4.523 | 1.01 ± 0.02 |
qft_16k_128 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.277 ± 0.141 | 4.177 | 4.619 | 1.02 ± 0.03 |
qft_16k_256 send ip 192.168.0.181 --port 54321 -f /tmp/test.raucb -q --io-uring |
4.242 ± 0.062 | 4.188 | 4.371 | 1.01 ± 0.02 |
Experienced while transferring from a Raspberry PI 3B to a Zynq-7000 (32-bit arm)
Especially bad on a slow network where scp's performance also degraded significantly.