Following some guidelines of the Rust Performance Book here are some things we can try to improve performance:
Add codegen-units = 1 to release build
Use a faster allocator. E.g. mimalloc works on all operating systems
Not so easy:
properly profile to identify hot parts
remove clones/allocations where not needed
use profile-guided optimization (e.g. via cargo-pgo)
unfortunately this is currently not working with LTO and the PGO version is 10-20% slower than LTO
might be available in the future in maturin directly, see here
Quick tests with codegen-units = 1 added to release-lto (see here) show performance improvements of benchmarks of up to 12% (mean is about 7%) while for dual_number, changes are a bit smaller (see below).
Proper benchmarks (across all benchmarks) with comparison to current release workflow are needed but this might be an easy-to-get improvement if it turns out to be faster for all cases.
Benchmark: dual_numbers
System: methane/CO2
main: main branch + lto
main_codegen: main branch + lto + codegen-units = 1
develop_: like main
Execution times in µs
name
f64
dual
dual2
hyperdual
dual3
main
1.1382
1.2325
1.4539
1.6267
1.7563
main_codegen
1.0229
1.1741
1.3708
1.5777
1.6316
develop
1.0138
1.1989
1.4465
1.589
1.7549
develop_codegen
0.9761
1.1681
1.4195
1.5446
1.6304
Slowdown t_f64/t_d for each branch/option
f64
dual
dual2
hyperdual
dual3
main
1
1.08285
1.27737
1.42919
1.54305
main_codegen
1
1.14782
1.34011
1.54238
1.59507
develop
1
1.18258
1.42681
1.56737
1.73101
develop_codegen
1
1.1967
1.45426
1.58242
1.67032
*Relative difference in % w.r.t. main + lto for each dual number (t_d_branch - t_d_main) / t_d_main 100**
Following some guidelines of the Rust Performance Book here are some things we can try to improve performance:
codegen-units = 1
to release buildNot so easy:
maturin
directly, see hereQuick tests with
codegen-units = 1
added torelease-lto
(see here) show performance improvements of benchmarks of up to 12% (mean is about 7%) while fordual_number
, changes are a bit smaller (see below).Proper benchmarks (across all benchmarks) with comparison to current release workflow are needed but this might be an easy-to-get improvement if it turns out to be faster for all cases.
main
: main branch + ltomain_codegen
: main branch + lto + codegen-units = 1develop_
: like main