Closed cortner closed 5 years ago
I think two things are happening.
I pushed a branch inlining
where I added a function inline_evaluate
in which everything gets inlined.
The results are
@btime f3($r_3) # 1.871 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($P3, $r_3) # 15.597 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.inline_evaluate($P3, $r_3) # 11.098 ns (0 allocations: 0 bytes)
@btime f4($r_4) # 8.719 ns
@btime StaticPolynomials.evaluate($P4, $r_4) # 24.174 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.inline_evaluate($P4, $r_4) # 14.211 ns (0 allocations: 0 bytes)
So the overhead drops significantly. My guess is that the remaining overhead is for multiplications with the coefficients and the fetching of the heap allocated coefficient array.
Thank you. Have you tried using Calculus.simplify?
I do not see where this would be useful, could you elaborate?
using Calculus
ex = :( 1 * x^1 * y^2 )
Calculus.simplify(ex)
# :(x * y ^ 2)
testing also the derivatives now: good news is your derivatives are much faster than ForwardDiff
(though I'm not sure why ForwardDiff feels the need to allocate here!), but the overhead is even bigger than before.
@btime Df3($r_3) # 3.701 ns
@btime StaticPolynomials.jacobian($P3, $r_3) # 26.666 ns
@btime ForwardDiff.jacobian($f3, $r_3, $cfg3) # 80.462 ns
@btime Df4($r_4) # 12.988 ns
@btime StaticPolynomials.jacobian($P4, $r_4) # 51.952 ns
@btime ForwardDiff.jacobian($f4, $r_4, $cfg4) # 199.193 ns
As I said above I am running these tests because (a) I really need the best performance I can get and (b) it feels that with code generation it should in principle be possible to match a "hand-coded" implementation.
I pushed some changes to the inlining branch which enables the inlining also for the Jacobian and reduces that overhead quite a bit. But for large problems it seems that the performance even gets worse, so this cannot be applied without some care.
(b) it feels that with code generation it should in principle be possible to match a "hand-coded" implementation.
I agree there, but one reason for the performance difference is that currently we do not generate the same source code as you wrote since the code is optimized for general polynomials. So there are tricks applied like Horner's method (and an extension for the multivariate case) which produce in general faster code but for your special polynomials (no variable has an exponent higher than 1) this does not seem to be the case.
Similarly, your specific polynomials only have coefficient one. That is a fact which can be nicely exploited in your application, but polynomials have in general non unit coefficients and to deal with special case just didn't seem worth it for me at the moment. I chose to optimize for the general case and obviously there will always be specific examples where this will not be the best case. But as a user you get something quite fast with minimal effort.
ok, fair enough. for those small polynomials I can do everything by hand. Here is one component of a system that shows a bit better what I am interested in:
p10 = Polynomial(
48*x1^3 + 72*x1^2*x2 + 72*x1^2*x3 + 72*x1^2*x4 + 72*x1^2*x5 + 72*x1^2*x7 +
72*x1^2*x8 + 72*x1*x2^2 + 144*x1*x2*x4 + 144*x1*x2*x7 + 72*x1*x3^2 +
144*x1*x3*x5 + 144*x1*x3*x8 + 72*x1*x4^2 + 144*x1*x4*x7 + 72*x1*x5^2 +
144*x1*x5*x8 + 72*x1*x7^2 + 72*x1*x8^2 + 48*x2^3 + 72*x2^2*x3 +
72*x2^2*x4 + 72*x2^2*x6 + 72*x2^2*x7 + 72*x2^2*x9 + 72*x2*x3^2 +
144*x2*x3*x6 + 144*x2*x3*x9 + 72*x2*x4^2 + 144*x2*x4*x7 + 72*x2*x6^2 +
144*x2*x6*x9 + 72*x2*x7^2 + 72*x2*x9^2 + 48*x3^3 + 72*x3^2*x5 +
72*x3^2*x6 + 72*x3^2*x8 + 72*x3^2*x9 + 72*x3*x5^2 + 144*x3*x5*x8 +
72*x3*x6^2 + 144*x3*x6*x9 + 72*x3*x8^2 + 72*x3*x9^2 + 48*x4^3 +
72*x4^2*x5 + 72*x4^2*x6 + 72*x4^2*x7 + 72*x4^2*x10 + 72*x4*x5^2 +
144*x4*x5*x6 + 144*x4*x5*x10 + 72*x4*x6^2 + 144*x4*x6*x10 + 72*x4*x7^2 +
72*x4*x10^2 + 48*x5^3 + 72*x5^2*x6 + 72*x5^2*x8 + 72*x5^2*x10 +
72*x5*x6^2 + 144*x5*x6*x10 + 72*x5*x8^2 + 72*x5*x10^2 + 48*x6^3 +
72*x6^2*x9 + 72*x6^2*x10 + 72*x6*x9^2 + 72*x6*x10^2 + 48*x7^3 +
72*x7^2*x8 + 72*x7^2*x9 + 72*x7^2*x10 + 72*x7*x8^2 + 144*x7*x8*x9 +
144*x7*x8*x10 + 72*x7*x9^2 + 144*x7*x9*x10 + 72*x7*x10^2 + 48*x8^3 +
72*x8^2*x9 + 72*x8^2*x10 + 72*x8*x9^2 + 144*x8*x9*x10 + 72*x8*x10^2 +
48*x9^3 + 72*x9^2*x10 + 72*x9*x10^2 + 48*x10^3 )
function f10(x)
f = 48*x[1]^3 + 72*x[1]^2*x[2] + 72*x[1]^2*x[3] + 72*x[1]^2*x[4] + 72*x[1]^2*x[5] + 72*x[1]^2*x[7]
f += 72*x[1]^2*x[8] + 72*x[1]*x[2]^2 + 144*x[1]*x[2]*x[4] + 144*x[1]*x[2]*x[7] + 72*x[1]*x[3]^2
f += 144*x[1]*x[3]*x[5] + 144*x[1]*x[3]*x[8] + 72*x[1]*x[4]^2 + 144*x[1]*x[4]*x[7] + 72*x[1]*x[5]^2
f += 144*x[1]*x[5]*x[8] + 72*x[1]*x[7]^2 + 72*x[1]*x[8]^2 + 48*x[2]^3 + 72*x[2]^2*x[3]
f += 72*x[2]^2*x[4] + 72*x[2]^2*x[6] + 72*x[2]^2*x[7] + 72*x[2]^2*x[9] + 72*x[2]*x[3]^2
f += 144*x[2]*x[3]*x[6] + 144*x[2]*x[3]*x[9] + 72*x[2]*x[4]^2 + 144*x[2]*x[4]*x[7] + 72*x[2]*x[6]^2
f += 144*x[2]*x[6]*x[9] + 72*x[2]*x[7]^2 + 72*x[2]*x[9]^2 + 48*x[3]^3 + 72*x[3]^2*x[5]
f += 72*x[3]^2*x[6] + 72*x[3]^2*x[8] + 72*x[3]^2*x[9] + 72*x[3]*x[5]^2 + 144*x[3]*x[5]*x[8]
f += 72*x[3]*x[6]^2 + 144*x[3]*x[6]*x[9] + 72*x[3]*x[8]^2 + 72*x[3]*x[9]^2 + 48*x[4]^3
f += 72*x[4]^2*x[5] + 72*x[4]^2*x[6] + 72*x[4]^2*x[7] + 72*x[4]^2*x[10] + 72*x[4]*x[5]^2
f += 144*x[4]*x[5]*x[6] + 144*x[4]*x[5]*x[10] + 72*x[4]*x[6]^2 + 144*x[4]*x[6]*x[10] + 72*x[4]*x[7]^2
f += 72*x[4]*x[10]^2 + 48*x[5]^3 + 72*x[5]^2*x[6] + 72*x[5]^2*x[8] + 72*x[5]^2*x[10]
f += 72*x[5]*x[6]^2 + 144*x[5]*x[6]*x[10] + 72*x[5]*x[8]^2 + 72*x[5]*x[10]^2 + 48*x[6]^3
f += 72*x[6]^2*x[9] + 72*x[6]^2*x[10] + 72*x[6]*x[9]^2 + 72*x[6]*x[10]^2 + 48*x[7]^3
f += 72*x[7]^2*x[8] + 72*x[7]^2*x[9] + 72*x[7]^2*x[10] + 72*x[7]*x[8]^2 + 144*x[7]*x[8]*x[9]
f += 144*x[7]*x[8]*x[10] + 72*x[7]*x[9]^2 + 144*x[7]*x[9]*x[10] + 72*x[7]*x[10]^2 + 48*x[8]^3
f += 72*x[8]^2*x[9] + 72*x[8]^2*x[10] + 72*x[8]*x[9]^2 + 144*x[8]*x[9]*x[10] + 72*x[8]*x[10]^2
f += 48*x[9]^3 + 72*x[9]^2*x[10] + 72*x[9]*x[10]^2 + 48*x[10]^3
return f
end
@btime f10($x) # 34.261 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($p10, $x) # 47.131 ns (0 allocations: 0 bytes)
@btime ForwardDiff.gradient($f10, $x) # 2.017 μs (73 allocations: 6.84 KiB)
@btime StaticPolynomials.gradient($p10, $x) # 102.675 ns (0 allocations: 0 bytes)
This is very impressive already, especially the fact that the cost of the gradient is only ~ factor 2.
With a bit of hand-tuning I can get my f10
down to < 20 ns, though I wonder wether this is really feasible in practise when I have about 50 of them, and some are much more complex than this one.
It looks like I really need to give this a go - any concerns about "production use"? Any plans to register the package?
Incidentally, while I have your attention :), I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?
Okay the evaluation cost really does not seem to be so great, maybe I can optimize this a little bit more when I have some spare time. And yes the gradient implementation is really nice. Note that you get the evaluation basically for free
@btime StaticPolynomials.gradient($p10, $x) # 93.121 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate_and_gradient($p10, $x) # 97.407 ns (0 allocations: 0 bytes)
So if you have something like Newton's method where you need both values this can come really really handy.
It looks like I really need to give this a go - any concerns about "production use"? Any plans to register the package?
Not really. I tried it in another package of mine and it worked flawlessly. You only need to keep in mind that the Polynomial
constructor is type instable, i.e., some function barrier is necessary.
I actually wanted to get the package registered today, but now I got sidetracked by this discussion :D But I think I will get this done this weekend.
I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?
I don't know anything in the Julia environment which could help with that. Is the tricky part figuring out the math or the software? If this only needs to be done once you could maybe use some computer algebra system like Macaulay 2. Funny thing, my PhD supervisor suggested that I should start to look into invariant theory. So I am not sure whether I can currently help you with this problem, but I would love to know more about the concrete problem. Also maybe Paul (@PBrdng) could help you. No promises, but maybe you could send an email with your concrete problem to the two of us?
At the moment we are using MAGMA, but it takes forever and gives the most horrendous polynomials. We are now starting over just "guessing" the primary invariants by hand and then checking with MAGMA that they actually are correct, and then let MAGMA construct the secondary invariants.
So btw you work with Sturmfels? Applications, algorithms, theory? He usually has some pretty exciting applications lined up.
My own application is that we use invariant polynomials as descriptors of atomic environments and then construct high-accuracy approximations to the potential energy landscape (you can call it machine learning if you want but I hate that term)
I am trying your inline
branch but I'm getting
julia> @btime StaticPolynomials.inline_evaluate($INV6Q, $Q)
ERROR: UndefVarError: inline_evaluate not defined
Btw, how would you rewrite this system (by hand) if you wanted to try and make StaticPolynomials
a bit faster?
INV6Q = StaticPolynomials.system(
[ Q1,
Q2^2 + Q3^2 + Q4^2,
Q2 * Q3 * Q4,
Q3^2 * Q4^2 + Q2^2 * Q4^2 + Q2^2 * Q3^2,
Q5^2 + Q6^2,
Q6^3 - 3*Q5^2 * Q6,
1.0,
Q6 * (2*Q2^2 - Q3^2 - Q4^2) + √3 * Q5 * (Q3^2 - Q4^2),
(Q6^2 - Q5^2) * (2*Q2^2 - Q3^2 - Q4^2) - 2 * √3 * Q5 * Q6 * (Q3^2 - Q4^2),
Q6 * (2*Q3^2 * Q4^2 - Q2^2 * Q4^2 - Q2^2 * Q3^2) + √3 * Q2 * (Q2^2 * Q4^2 - Q2^2 * Q3^2),
(Q6^2 - Q5^2)*(2*Q3^2*Q4^2 - Q2^2*Q4^2 -Q2^2*Q3^2) - 2*√3 * Q5 * Q6 * (Q2^2*Q4^2 - Q2^2*Q3^2),
(Q3^2 - Q4^2) * (Q4^2 - Q2^2) * (Q2^2 - Q3^2) * Q5 * (3*Q6^2 - Q5^2)
])
With StaticPolynomials
this evaluates in ca 50ns, my hand-coded version in under 20ns (but I had to create some intermediate variables).
So btw you work with Sturmfels? Applications, algorithms, theory? He usually has some pretty exciting applications lined up.
I am working on the solution of systems of polynomials by (numerical) homotopy continuation. For this I am developing the package HomotopyContinuation.jl. There are already some software packages which do this (e.g. Bertini, PHCpack) but by using Julia it becomes much easier to adapt the algorithm to specific applications. Also things like this package help tremendously with the performance.
I am trying your inline branch but I'm getting
Sorry, I updated the branch and just inlined the normal evaluate (as well as the gradient) in order to benchmark the changes and got rid of the special function. Here are the results of the benchmark, which in general look good but I am a little bit concerned about the regression in the larger system rps10
.
I just tagged a first version (without any inlining) to get the registration done since this can take some time in my experience. If I figured out how to improve the performance consistently I will make a point update.
Thanks - I also saw that the performance was not much better for a slightly larger system.
Incidentally, while I have your attention :), I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?
Hi, as Sascha asked me to enter the discussion: Could you explain in more detail what you are aiming at?
Happy to write more detail. here or better in a gutter channel?
Email or here, either is fine for me.
@saschatimme I've been experimenting with my own version of @generated
fast polynomial evaluation. Since I have a single use-case, I can optimise quite a bit, and in particular can sacrifice "user-friendliness". That said, it still feels that it should be possible to make StaticPolynomials
nearly as fast, so I'd be curious to explore this some more.
One possibility is that I am just sacrificing numerical stability for speed. Is this at all a concern for you? (Is your Horner scheme numerically stable?) EDIT: if this is the main issue then it might be interesting to have two types of StaticPolynomials
, a stable one and an unstable but very fast one. Already now, when I am very happy with the performance I have I can still see many more optimisations to explore.
PS: I can post some benchmarks here, if it is of interest, once I've managed to polish the code and put it in a public repository.
I think it would be great to have a faster evaluation implementation available. So please share what you have :)
One possibility is that I am just sacrificing numerical stability for speed. Is this at all a concern for you?
The current evaluation and gradient implementation should be numerically stable, and for certain applications it is probably important. Since the current evaluation is already implemented we could easily add another type parameter to give the user a choice whether speed is more important than accuracy.
Did you do something recently to StaticPolynomials
? I am suddenly getting much better performance. Maybe I just used it incorrectly before, or maybe in a different context. The gradients are still a bit slower than my own, but not very much.
I found some inefficiencies in the generated code and got rid of them with #10
I think there might still be some things to be exploited, but I just wanted to say I am eternally grateful for your package and wanted to share a quick benchmark:
Profile Assembly Performance
E PIP: 681.886 ms (1717 allocations: 1.44 MiB)
E fast-PIP: 27.518 ms (1717 allocations: 1.44 MiB)
F PIP: 907.064 ms (1715 allocations: 1.45 MiB)
F fast-PIP: 51.129 ms (1715 allocations: 1.45 MiB)
PIP
are polynomials of invariants. I generate the invariants myself (this is still a bit faster than StaticPolynomials, but I want to test this careful again when I find the time), but the "outer" polynomials I then represent as StaticPolynomials => fast-PIP
. You can see the performance gain. And I thought my own code was actually reasonably optimised already. Fantastic!
(P.S.: and I don't have to implement gradients)
@cortner That's great to here! I also think that there are some things to be exploited. I think it would be interesting to implement the algorithm described in this paper for the evaluation.
Also note that in Julia 0.7 the compile times as well as the performance improves significantly. :)
Timings for your example polynomial p10
above in 0.7
julia> @btime evaluate($p10, $w)
39.783 ns (0 allocations: 0 bytes)
julia> @btime gradient($p10, $w)
76.963 ns (0 allocations: 0 bytes)
vs. 0.6
julia> @btime evaluate($p10, $w)
56.208 ns (0 allocations: 0 bytes)
julia> @btime gradient($p10, $w)
103.934 ns (0 allocations: 0 bytes)
@cortner closing this since we now have for the example system p10
above:
julia> @btime f10($y)
31.780 ns (0 allocations: 0 bytes)
394.1220308523676
julia> @btime $p10($y)
28.717 ns (0 allocations: 0 bytes)
394.1220308523676
julia> @btime gradient($p10, $y)
73.109 ns (0 allocations: 0 bytes)
julia> versioninfo()
Julia Version 0.7.0
Commit a4cb80f3ed (2018-08-08 06:46 UTC)
Platform Info:
OS: macOS (x86_64-apple-darwin14.5.0)
CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
JULIA_NUM_THREADS = 4
To further improve the evaluation performance I think #22 is the way to go.
I've been playing a bit testing the performance of this package. Taking one of your own tests (from some .md I found):
This is quite a high overhead. I also tested more complex polynomials, where my "manual" implementation is more in the range 50-100ns, and that case the overhead seems to reduce.
Questions:
I should say that my use-case is that I generate polynomials once and then call them millions of times, so a 20ns overhead can matter.