Open imrn99 opened 2 weeks ago
I'm running experiment with a fork of the stm crate. I tried replacing some internal data structures with no-alloc variants (using heapless
& arrayvec
). There seem to be little to no change in performance, although this needs confirmation with benchmarks on more consistent machines.
There is one structure that I haven't been able to replace though: a BTreeMap
used to keep track of accessed variable in a Transaction
. There are no available equivalent atm (heapless
's LinearMap
lacks one method).
Essentially, TVar
's inits and operations take much more time (several order of magnitudes for inits) than atomics (even Atomic<T>
from the atomic
crate). This raises multiple questions:
perf report of the sequential impl, benching the builder
binary:
perf report of the parallel impl:
it seems like the stm variables require that many more cycles to run; for example, the number of cycles dedicated to computing orbits goes from 10% down to less than 1%
perf stat -d
of the builder
binary, seq impl:
Performance counter stats for './target/profiling/builder 1024 1024':
137,67 msec task-clock # 0,997 CPUs utilized
2 context-switches # 14,527 /sec
1 cpu-migrations # 7,264 /sec
36 951 page-faults # 268,402 K/sec
742 525 510 cycles # 5,394 GHz (70,98%)
151 260 278 stalled-cycles-frontend # 20,37% frontend cycles idle (71,01%)
2 331 338 369 instructions # 3,14 insn per cycle
# 0,06 stalled cycles per insn (70,91%)
418 462 962 branches # 3,040 G/sec (71,55%)
5 624 721 branch-misses # 1,34% of all branches (72,30%)
699 688 406 L1-dcache-loads # 5,082 G/sec (72,01%)
9 674 762 L1-dcache-load-misses # 1,38% of all L1-dcache accesses (71,24%)
<not supported> LLC-loads
<not supported> LLC-load-misses
0,138110185 seconds time elapsed
0,085484000 seconds user
0,051671000 seconds sys
parallel impl:
Performance counter stats for 'taskset -c 0 ./target/profiling/builder 1024 1024':
2 723,26 msec task-clock # 0,998 CPUs utilized
137 context-switches # 50,307 /sec
1 cpu-migrations # 0,367 /sec
712 863 page-faults # 261,768 K/sec
14 732 320 056 cycles # 5,410 GHz (71,36%)
2 820 127 554 stalled-cycles-frontend # 19,14% frontend cycles idle (71,43%)
44 049 373 349 instructions # 2,99 insn per cycle
# 0,06 stalled cycles per insn (71,43%)
8 484 034 434 branches # 3,115 G/sec (71,46%)
110 288 231 branch-misses # 1,30% of all branches (71,49%)
13 633 379 729 L1-dcache-loads # 5,006 G/sec (71,45%)
227 469 230 L1-dcache-load-misses # 1,67% of all L1-dcache accesses (71,39%)
<not supported> LLC-loads
<not supported> LLC-load-misses
2,728805716 seconds time elapsed
1,909099000 seconds user
0,798234000 seconds sys
I noticed that a huge regression in performance happened with the changes of #201.
Observed changes
while we can expect a loss of speed due to sync overhead, the observed changes were absurdly important:
builder-time
grisubal-time
Cause
I haven't fully identified the cause(s) yet, but my first impression is that the initalization of
TVar
s is very costly, see the snippet below from the stm crate source code: