Overhead - Githubissues

cortner commented 6 years ago

I've been playing a bit testing the performance of this package. Taking one of your own tests (from some .md I found):

using DynamicPolynomials: @polyvar
using StaticPolynomials, StaticArrays, BenchmarkTools

f3(r::SVector{3}) = SVector(
         r[1] + r[2] + r[3],
         r[1]*r[2] + r[2]*r[3] + r[1]*r[3],
         r[1]*r[2]*r[3] )
f4(r::SVector{4}) = SVector(
         r[1] + r[2] + r[3] + r[4],
         r[1]*r[2] + r[1]*r[3] + r[1]*r[4] + r[2]*r[3] + r[2]*r[4] + r[3]*r[4],
         r[1]*r[2]*r[3] + r[1]*r[2]*r[4] + r[1]*r[3]*r[4] + r[2]*r[3]*r[4],
         r[1]*r[2]*r[3]*r[4] )

@polyvar r1 r2 r3 r4
P3 = system([   r1 + r2 + r3,
                  r1*r2 + r2*r3 + r1*r3,
                  r1*r2*r3 ])
P4 = system([ r1 + r2 + r3 + r4, 
               r1*r2 + r1*r3 + r1*r4 + r2*r3 + r2*r4 + r3*r4,
              r1*r2*r3 + r1*r2*r4 + r1*r3*r4 + r2*r3*r4,
              r1*r2*r3*r4   ])

r_3 = @SVector rand(3)
r_4 = @SVector rand(4)

@btime f3($r_3)    # 2.634 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($P3, $r_3)   # 18.491 ns (0 allocations: 0 bytes)
@btime f4($r_4).        # 10.042 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($P4, $r_4)   # 28.735 ns (0 allocations: 0 bytes)

This is quite a high overhead. I also tested more complex polynomials, where my "manual" implementation is more in the range 50-100ns, and that case the overhead seems to reduce.

Questions:

what is the origin of the overhead?
is there any hope of reducing it?

I should say that my use-case is that I generate polynomials once and then call them millions of times, so a 20ns overhead can matter.

saschatimme commented 6 years ago

I think two things are happening.

The current implementation evaluates each polynomial in a system separately. So there is a constant overhead for the function calls (My intentation was to reduce compilation overhead if you have similar systems during the computation, but maybe this is not such an issue).
Your polynomials "cheat" because they do not evaluate their coefficients. Since the functions are only specialised for a given support we cannot know that the coefficients are just 1 for each monomial.

I pushed a branch inlining where I added a function inline_evaluate in which everything gets inlined.

The results are

@btime f3($r_3)    # 1.871 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($P3, $r_3)   # 15.597 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.inline_evaluate($P3, $r_3)   # 11.098 ns (0 allocations: 0 bytes)
@btime f4($r_4)        # 8.719 ns
@btime StaticPolynomials.evaluate($P4, $r_4)   # 24.174 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.inline_evaluate($P4, $r_4)   # 14.211 ns (0 allocations: 0 bytes)

So the overhead drops significantly. My guess is that the remaining overhead is for multiplications with the coefficients and the fetching of the heap allocated coefficient array.

cortner commented 6 years ago

Thank you. Have you tried using Calculus.simplify?

saschatimme commented 6 years ago

I do not see where this would be useful, could you elaborate?

cortner commented 6 years ago

using Calculus
ex = :( 1 * x^1 * y^2 )
Calculus.simplify(ex)
# :(x * y ^ 2)

cortner commented 6 years ago

testing also the derivatives now: good news is your derivatives are much faster than ForwardDiff (though I'm not sure why ForwardDiff feels the need to allocate here!), but the overhead is even bigger than before.

@btime Df3($r_3)  #  3.701 ns
@btime StaticPolynomials.jacobian($P3, $r_3) # 26.666 ns
@btime ForwardDiff.jacobian($f3, $r_3, $cfg3)   # 80.462 ns
@btime Df4($r_4)   #  12.988 ns
@btime StaticPolynomials.jacobian($P4, $r_4)  #  51.952 ns
@btime ForwardDiff.jacobian($f4, $r_4, $cfg4)   #   199.193 ns

As I said above I am running these tests because (a) I really need the best performance I can get and (b) it feels that with code generation it should in principle be possible to match a "hand-coded" implementation.

saschatimme commented 6 years ago

I pushed some changes to the inlining branch which enables the inlining also for the Jacobian and reduces that overhead quite a bit. But for large problems it seems that the performance even gets worse, so this cannot be applied without some care.

(b) it feels that with code generation it should in principle be possible to match a "hand-coded" implementation.

I agree there, but one reason for the performance difference is that currently we do not generate the same source code as you wrote since the code is optimized for general polynomials. So there are tricks applied like Horner's method (and an extension for the multivariate case) which produce in general faster code but for your special polynomials (no variable has an exponent higher than 1) this does not seem to be the case.

Similarly, your specific polynomials only have coefficient one. That is a fact which can be nicely exploited in your application, but polynomials have in general non unit coefficients and to deal with special case just didn't seem worth it for me at the moment. I chose to optimize for the general case and obviously there will always be specific examples where this will not be the best case. But as a user you get something quite fast with minimal effort.

cortner commented 6 years ago

ok, fair enough. for those small polynomials I can do everything by hand. Here is one component of a system that shows a bit better what I am interested in:

p10 = Polynomial(
   48*x1^3 + 72*x1^2*x2 + 72*x1^2*x3 + 72*x1^2*x4 + 72*x1^2*x5 + 72*x1^2*x7 +
   72*x1^2*x8 + 72*x1*x2^2 + 144*x1*x2*x4 + 144*x1*x2*x7 + 72*x1*x3^2 +
   144*x1*x3*x5 + 144*x1*x3*x8 + 72*x1*x4^2 + 144*x1*x4*x7 + 72*x1*x5^2 +
   144*x1*x5*x8 + 72*x1*x7^2 + 72*x1*x8^2 + 48*x2^3 + 72*x2^2*x3 +
   72*x2^2*x4 + 72*x2^2*x6 + 72*x2^2*x7 + 72*x2^2*x9 + 72*x2*x3^2 +
   144*x2*x3*x6 + 144*x2*x3*x9 + 72*x2*x4^2 + 144*x2*x4*x7 + 72*x2*x6^2 +
   144*x2*x6*x9 + 72*x2*x7^2 + 72*x2*x9^2 + 48*x3^3 + 72*x3^2*x5 +
   72*x3^2*x6 + 72*x3^2*x8 + 72*x3^2*x9 + 72*x3*x5^2 + 144*x3*x5*x8 +
   72*x3*x6^2 + 144*x3*x6*x9 + 72*x3*x8^2 + 72*x3*x9^2 + 48*x4^3 +
   72*x4^2*x5 + 72*x4^2*x6 + 72*x4^2*x7 + 72*x4^2*x10 + 72*x4*x5^2 +
   144*x4*x5*x6 + 144*x4*x5*x10 + 72*x4*x6^2 + 144*x4*x6*x10 + 72*x4*x7^2 +
   72*x4*x10^2 + 48*x5^3 + 72*x5^2*x6 + 72*x5^2*x8 + 72*x5^2*x10 +
   72*x5*x6^2 + 144*x5*x6*x10 + 72*x5*x8^2 + 72*x5*x10^2 + 48*x6^3 +
   72*x6^2*x9 + 72*x6^2*x10 + 72*x6*x9^2 + 72*x6*x10^2 + 48*x7^3 +
   72*x7^2*x8 + 72*x7^2*x9 + 72*x7^2*x10 + 72*x7*x8^2 + 144*x7*x8*x9 +
   144*x7*x8*x10 + 72*x7*x9^2 + 144*x7*x9*x10 + 72*x7*x10^2 + 48*x8^3 +
   72*x8^2*x9 + 72*x8^2*x10 + 72*x8*x9^2 + 144*x8*x9*x10 + 72*x8*x10^2 +
   48*x9^3 + 72*x9^2*x10 + 72*x9*x10^2 + 48*x10^3 )

function f10(x)
   f  = 48*x[1]^3 + 72*x[1]^2*x[2] + 72*x[1]^2*x[3] + 72*x[1]^2*x[4] + 72*x[1]^2*x[5] + 72*x[1]^2*x[7]
   f += 72*x[1]^2*x[8] + 72*x[1]*x[2]^2 + 144*x[1]*x[2]*x[4] + 144*x[1]*x[2]*x[7] + 72*x[1]*x[3]^2
   f += 144*x[1]*x[3]*x[5] + 144*x[1]*x[3]*x[8] + 72*x[1]*x[4]^2 + 144*x[1]*x[4]*x[7] + 72*x[1]*x[5]^2
   f += 144*x[1]*x[5]*x[8] + 72*x[1]*x[7]^2 + 72*x[1]*x[8]^2 + 48*x[2]^3 + 72*x[2]^2*x[3]
   f += 72*x[2]^2*x[4] + 72*x[2]^2*x[6] + 72*x[2]^2*x[7] + 72*x[2]^2*x[9] + 72*x[2]*x[3]^2
   f += 144*x[2]*x[3]*x[6] + 144*x[2]*x[3]*x[9] + 72*x[2]*x[4]^2 + 144*x[2]*x[4]*x[7] + 72*x[2]*x[6]^2
   f += 144*x[2]*x[6]*x[9] + 72*x[2]*x[7]^2 + 72*x[2]*x[9]^2 + 48*x[3]^3 + 72*x[3]^2*x[5]
   f += 72*x[3]^2*x[6] + 72*x[3]^2*x[8] + 72*x[3]^2*x[9] + 72*x[3]*x[5]^2 + 144*x[3]*x[5]*x[8]
   f += 72*x[3]*x[6]^2 + 144*x[3]*x[6]*x[9] + 72*x[3]*x[8]^2 + 72*x[3]*x[9]^2 + 48*x[4]^3
   f += 72*x[4]^2*x[5] + 72*x[4]^2*x[6] + 72*x[4]^2*x[7] + 72*x[4]^2*x[10] + 72*x[4]*x[5]^2
   f += 144*x[4]*x[5]*x[6] + 144*x[4]*x[5]*x[10] + 72*x[4]*x[6]^2 + 144*x[4]*x[6]*x[10] + 72*x[4]*x[7]^2
   f += 72*x[4]*x[10]^2 + 48*x[5]^3 + 72*x[5]^2*x[6] + 72*x[5]^2*x[8] + 72*x[5]^2*x[10]
   f += 72*x[5]*x[6]^2 + 144*x[5]*x[6]*x[10] + 72*x[5]*x[8]^2 + 72*x[5]*x[10]^2 + 48*x[6]^3
   f += 72*x[6]^2*x[9] + 72*x[6]^2*x[10] + 72*x[6]*x[9]^2 + 72*x[6]*x[10]^2 + 48*x[7]^3
   f += 72*x[7]^2*x[8] + 72*x[7]^2*x[9] + 72*x[7]^2*x[10] + 72*x[7]*x[8]^2 + 144*x[7]*x[8]*x[9]
   f += 144*x[7]*x[8]*x[10] + 72*x[7]*x[9]^2 + 144*x[7]*x[9]*x[10] + 72*x[7]*x[10]^2 + 48*x[8]^3
   f += 72*x[8]^2*x[9] + 72*x[8]^2*x[10] + 72*x[8]*x[9]^2 + 144*x[8]*x[9]*x[10] + 72*x[8]*x[10]^2
   f += 48*x[9]^3 + 72*x[9]^2*x[10] + 72*x[9]*x[10]^2 + 48*x[10]^3
   return f
end

@btime f10($x)   # 34.261 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate($p10, $x)   # 47.131 ns (0 allocations: 0 bytes)
@btime ForwardDiff.gradient($f10, $x)   # 2.017 μs (73 allocations: 6.84 KiB)
@btime StaticPolynomials.gradient($p10, $x)   # 102.675 ns (0 allocations: 0 bytes)

This is very impressive already, especially the fact that the cost of the gradient is only ~ factor 2.

With a bit of hand-tuning I can get my f10 down to < 20 ns, though I wonder wether this is really feasible in practise when I have about 50 of them, and some are much more complex than this one.

cortner commented 6 years ago

It looks like I really need to give this a go - any concerns about "production use"? Any plans to register the package?

cortner commented 6 years ago

Incidentally, while I have your attention :), I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?

saschatimme commented 6 years ago

Okay the evaluation cost really does not seem to be so great, maybe I can optimize this a little bit more when I have some spare time. And yes the gradient implementation is really nice. Note that you get the evaluation basically for free

@btime StaticPolynomials.gradient($p10, $x) # 93.121 ns (0 allocations: 0 bytes)
@btime StaticPolynomials.evaluate_and_gradient($p10, $x) # 97.407 ns (0 allocations: 0 bytes)

So if you have something like Newton's method where you need both values this can come really really handy.

It looks like I really need to give this a go - any concerns about "production use"? Any plans to register the package?

Not really. I tried it in another package of mine and it worked flawlessly. You only need to keep in mind that the Polynomial constructor is type instable, i.e., some function barrier is necessary. I actually wanted to get the package registered today, but now I got sidetracked by this discussion :D But I think I will get this done this weekend.

I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?

I don't know anything in the Julia environment which could help with that. Is the tricky part figuring out the math or the software? If this only needs to be done once you could maybe use some computer algebra system like Macaulay 2. Funny thing, my PhD supervisor suggested that I should start to look into invariant theory. So I am not sure whether I can currently help you with this problem, but I would love to know more about the concrete problem. Also maybe Paul (@PBrdng) could help you. No promises, but maybe you could send an email with your concrete problem to the two of us?

cortner commented 6 years ago

At the moment we are using MAGMA, but it takes forever and gives the most horrendous polynomials. We are now starting over just "guessing" the primary invariants by hand and then checking with MAGMA that they actually are correct, and then let MAGMA construct the secondary invariants.

cortner commented 6 years ago

So btw you work with Sturmfels? Applications, algorithms, theory? He usually has some pretty exciting applications lined up.

My own application is that we use invariant polynomials as descriptors of atomic environments and then construct high-accuracy approximations to the potential energy landscape (you can call it machine learning if you want but I hate that term)

cortner commented 6 years ago

I am trying your inline branch but I'm getting

julia> @btime StaticPolynomials.inline_evaluate($INV6Q, $Q)
ERROR: UndefVarError: inline_evaluate not defined

cortner commented 6 years ago

Btw, how would you rewrite this system (by hand) if you wanted to try and make StaticPolynomials a bit faster?

INV6Q = StaticPolynomials.system(
   [  Q1,
      Q2^2 + Q3^2 + Q4^2,
      Q2 * Q3 * Q4,
      Q3^2 * Q4^2 + Q2^2 * Q4^2 + Q2^2 * Q3^2,
      Q5^2 + Q6^2,
      Q6^3 - 3*Q5^2 * Q6,
      1.0,
      Q6 * (2*Q2^2 - Q3^2 - Q4^2) + √3 * Q5 * (Q3^2 - Q4^2),
      (Q6^2 - Q5^2) * (2*Q2^2 - Q3^2 - Q4^2) - 2 * √3 * Q5 * Q6 * (Q3^2 - Q4^2),
      Q6 * (2*Q3^2 * Q4^2 - Q2^2 * Q4^2 - Q2^2 * Q3^2) + √3 * Q2 * (Q2^2 * Q4^2 - Q2^2 * Q3^2),
      (Q6^2 - Q5^2)*(2*Q3^2*Q4^2 - Q2^2*Q4^2 -Q2^2*Q3^2) - 2*√3 * Q5 * Q6 * (Q2^2*Q4^2 - Q2^2*Q3^2),
      (Q3^2 - Q4^2) * (Q4^2 - Q2^2) * (Q2^2 - Q3^2) * Q5 * (3*Q6^2 - Q5^2)
   ])

With StaticPolynomials this evaluates in ca 50ns, my hand-coded version in under 20ns (but I had to create some intermediate variables).

saschatimme commented 6 years ago

So btw you work with Sturmfels? Applications, algorithms, theory? He usually has some pretty exciting applications lined up.

I am working on the solution of systems of polynomials by (numerical) homotopy continuation. For this I am developing the package HomotopyContinuation.jl. There are already some software packages which do this (e.g. Bertini, PHCpack) but by using Julia it becomes much easier to adapt the algorithm to specific applications. Also things like this package help tremendously with the performance.

I am trying your inline branch but I'm getting

Sorry, I updated the branch and just inlined the normal evaluate (as well as the gradient) in order to benchmark the changes and got rid of the special function. Here are the results of the benchmark, which in general look good but I am a little bit concerned about the regression in the larger system rps10.

saschatimme commented 6 years ago

I just tagged a first version (without any inlining) to get the registration done since this can take some time in my experience. If I figured out how to improve the performance consistently I will make a point update.

cortner commented 6 years ago

Thanks - I also saw that the performance was not much better for a slightly larger system.

PBrdng commented 6 years ago

Incidentally, while I have your attention :), I am trying to generate polynomials that are invariant w.r.t. certain permutation groups. Do you know of anything / anybody in the Julia environment that could help me?

Hi, as Sascha asked me to enter the discussion: Could you explain in more detail what you are aiming at?

cortner commented 6 years ago

Happy to write more detail. here or better in a gutter channel?

PBrdng commented 6 years ago

Email or here, either is fine for me.

cortner commented 6 years ago

@saschatimme I've been experimenting with my own version of @generated fast polynomial evaluation. Since I have a single use-case, I can optimise quite a bit, and in particular can sacrifice "user-friendliness". That said, it still feels that it should be possible to make StaticPolynomials nearly as fast, so I'd be curious to explore this some more.

One possibility is that I am just sacrificing numerical stability for speed. Is this at all a concern for you? (Is your Horner scheme numerically stable?) EDIT: if this is the main issue then it might be interesting to have two types of StaticPolynomials, a stable one and an unstable but very fast one. Already now, when I am very happy with the performance I have I can still see many more optimisations to explore.

PS: I can post some benchmarks here, if it is of interest, once I've managed to polish the code and put it in a public repository.

saschatimme commented 6 years ago

I think it would be great to have a faster evaluation implementation available. So please share what you have :)

One possibility is that I am just sacrificing numerical stability for speed. Is this at all a concern for you?

The current evaluation and gradient implementation should be numerically stable, and for certain applications it is probably important. Since the current evaluation is already implemented we could easily add another type parameter to give the user a choice whether speed is more important than accuracy.

cortner commented 6 years ago

Did you do something recently to StaticPolynomials? I am suddenly getting much better performance. Maybe I just used it incorrectly before, or maybe in a different context. The gradients are still a bit slower than my own, but not very much.

saschatimme commented 6 years ago

I found some inefficiencies in the generated code and got rid of them with #10

cortner commented 6 years ago

I think there might still be some things to be exploited, but I just wanted to say I am eternally grateful for your package and wanted to share a quick benchmark:

Profile Assembly Performance
 E      PIP:  681.886 ms (1717 allocations: 1.44 MiB)
 E fast-PIP:  27.518 ms (1717 allocations: 1.44 MiB)
 F      PIP:  907.064 ms (1715 allocations: 1.45 MiB)
 F fast-PIP:  51.129 ms (1715 allocations: 1.45 MiB)

PIP are polynomials of invariants. I generate the invariants myself (this is still a bit faster than StaticPolynomials, but I want to test this careful again when I find the time), but the "outer" polynomials I then represent as StaticPolynomials => fast-PIP. You can see the performance gain. And I thought my own code was actually reasonably optimised already. Fantastic!

(P.S.: and I don't have to implement gradients)

saschatimme commented 6 years ago

@cortner That's great to here! I also think that there are some things to be exploited. I think it would be interesting to implement the algorithm described in this paper for the evaluation.

Also note that in Julia 0.7 the compile times as well as the performance improves significantly. :) Timings for your example polynomial p10 above in 0.7

julia> @btime evaluate($p10, $w)
  39.783 ns (0 allocations: 0 bytes)
julia> @btime gradient($p10, $w)
  76.963 ns (0 allocations: 0 bytes)

vs. 0.6

julia> @btime evaluate($p10, $w)
  56.208 ns (0 allocations: 0 bytes)
julia> @btime gradient($p10, $w)
  103.934 ns (0 allocations: 0 bytes)

saschatimme commented 5 years ago

@cortner closing this since we now have for the example system p10 above:

julia> @btime f10($y)
  31.780 ns (0 allocations: 0 bytes)
394.1220308523676

julia> @btime $p10($y)
  28.717 ns (0 allocations: 0 bytes)
394.1220308523676

julia> @btime gradient($p10, $y)
  73.109 ns (0 allocations: 0 bytes)

julia> versioninfo()
Julia Version 0.7.0
Commit a4cb80f3ed (2018-08-08 06:46 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin14.5.0)
  CPU: Intel(R) Core(TM) i5-7500 CPU @ 3.40GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-6.0.0 (ORCJIT, skylake)
Environment:
  JULIA_NUM_THREADS = 4

saschatimme commented 5 years ago

To further improve the evaluation performance I think #22 is the way to go.

JuliaAlgebra / StaticPolynomials.jl

Overhead #7