[Performance] Single evaluation results

Here is the speed of SymbolicRegression.jl in evaluating a single expression with 48 nodes, over development history since v0.5.0:

v0.5.0   11.709 μs (58 allocations: 18.64 KiB)
v0.5.1   11.625 μs (58 allocations: 18.64 KiB)
v0.5.2   11.750 μs (58 allocations: 19.03 KiB)
v0.5.3   11.792 μs (58 allocations: 19.03 KiB)
v0.5.4   11.792 μs (58 allocations: 19.03 KiB)
v0.5.5   11.834 μs (58 allocations: 19.03 KiB)
v0.5.6   11.750 μs (58 allocations: 19.03 KiB)
v0.5.7   11.625 μs (58 allocations: 19.03 KiB)
v0.5.8   11.708 μs (58 allocations: 19.03 KiB)
v0.5.9   11.958 μs (58 allocations: 19.03 KiB)
v0.5.10   11.666 μs (58 allocations: 18.64 KiB)
v0.5.11   11.625 μs (58 allocations: 18.64 KiB)
v0.5.12   11.625 μs (58 allocations: 18.64 KiB)
v0.5.13   11.791 μs (58 allocations: 19.42 KiB)
v0.5.14   11.833 μs (58 allocations: 19.42 KiB)
v0.5.15   11.708 μs (58 allocations: 19.42 KiB)
v0.5.16   11.750 μs (58 allocations: 19.42 KiB)
v0.6.0   11.500 μs (58 allocations: 19.42 KiB)
v0.6.1   11.750 μs (58 allocations: 19.81 KiB)
v0.6.2   11.666 μs (58 allocations: 19.81 KiB)
v0.6.3   11.666 μs (58 allocations: 19.81 KiB)
v0.6.4   14.583 μs (58 allocations: 19.81 KiB)
v0.6.5   14.583 μs (58 allocations: 19.81 KiB)
v0.6.6   14.375 μs (58 allocations: 19.81 KiB)
v0.6.7   14.625 μs (58 allocations: 19.81 KiB)
v0.6.8   14.542 μs (58 allocations: 19.81 KiB)
v0.6.9   14.625 μs (58 allocations: 19.81 KiB)
v0.6.10   14.416 μs (58 allocations: 19.81 KiB)
v0.7.0   14.500 μs (58 allocations: 20.20 KiB)
v0.7.1   14.625 μs (58 allocations: 20.20 KiB)
v0.7.2   14.583 μs (58 allocations: 20.20 KiB)
v0.7.3   14.458 μs (58 allocations: 20.20 KiB)
v0.7.4   14.541 μs (58 allocations: 20.20 KiB)
v0.7.5   14.625 μs (58 allocations: 20.20 KiB)
v0.7.6   14.417 μs (58 allocations: 20.20 KiB)
v0.7.7   14.458 μs (58 allocations: 20.20 KiB)
v0.7.8   14.458 μs (58 allocations: 20.20 KiB)
v0.7.9   14.541 μs (58 allocations: 20.59 KiB)
v0.7.10   14.416 μs (58 allocations: 21.38 KiB)
v0.7.11   14.500 μs (58 allocations: 21.38 KiB)
v0.7.12   14.458 μs (58 allocations: 21.38 KiB)

As can be seen, a major performance regression happened from 0.6.3 to 0.6.4. The change can be seen here: https://github.com/MilesCranmer/SymbolicRegression.jl/compare/v0.6.3...v0.6.4.

This was a necessary change to deal with NaNs and Infs, but I'm not sure it should impact performance that badly...

It looks like checking for NaNs/Infs within the SIMD loop is a major issue for the compiler. Will try checking if moving the NaN/Inf checks out of the loop gets a performance improvement or not.

(run this with:

git tag > tags.txt

# (Remove up to v0.5.0)

# Collect data:
for x in $(cat tags.txt); do git checkout $x 2>&1 > /dev/null && echo -n "${x} " && julia --project=. -O3 single_eval.jl; done >> benchmark_results.txt

# Sort and parse (requires vim-stream)
cat benchmark_results.txt | grep -v HEAD | vims -l 'xf.r f.r ' | sort -k1n -k2n -k3n | vims -l 'Iv\<esc>f r.f r.' |vims -l 'fndf '

Okay sped things up. New time is:

nodes=48   10.917 μs (58 allocations: 21.38 KiB)

which is awesome.

Changes included:

Moving NaN checking outside of SIMD loop
Reducing redundant NaN checks
Faster NaN checker (thanks @oscardssmith https://discourse.julialang.org/t/fastest-way-to-check-for-inf-or-nan-in-an-array/76954/20) with LoopVectorization.jl

I also tried using LoopVectorization @turbo for the other expression evaluations in EvaluationEquation.jl but it seems to slow them down compared to simply @inbounds @simd - not sure why.

The full filechanges can be viewed here: https://github.com/MilesCranmer/SymbolicRegression.jl/compare/ac19f1f..e1f1127

I experimented with turning off the kernel fusing, but it is actually super important for the speed. So for future speedups it would be good to allow for more complex kernel fusions or output LRU caching.

MilesCranmer / SymbolicRegression.jl

[Performance] Single evaluation results #73