Open roccaa opened 6 years ago
We saw that a "loop-based" approach such that:
(1) uses Julia's column-major ordering (see this performance tip), i.e. you want the inner loops to run through the left-most indices, (2) Avoids multiplications duplicates using auxiliary variables,
looks like the following, in the particular case of dimension 8:
function fauxc(a, b, c, d, e, f, g, h)
A = Array{Float64}(length(a), length(b), length(c), length(d), length(e), length(f), length(g), length(h))
@inbounds begin
for i8 in eachindex(h)
for i7 in eachindex(g)
aux_i7_i8 = g[i7] * h[i8]
for i6 in eachindex(f)
aux_i6_i7_i8 = f[i6] * aux_i7_i8
for i5 in eachindex(e)
aux_i5_i6_i7_i8 = e[i5] * aux_i6_i7_i8
for i4 in eachindex(d)
aux_i4_i5_i6_i7_i8 = d[i4] * aux_i5_i6_i7_i8
for i3 in eachindex(c)
aux_i3_i4_i5_i6_i7_i8 = c[i3] * aux_i4_i5_i6_i7_i8
for i2 in eachindex(b)
aux_i2_i3_i4_i5_i6_i7_i8 = b[i2] * aux_i3_i4_i5_i6_i7_i8
for i1 in eachindex(a)
A[i1, i2, i3, i4, i5, i6, i7, i8] = a[i1] * aux_i2_i3_i4_i5_i6_i7_i8
end
end
end
end
end
end
end
end
end
return A
end
Notes:
(1) if we go for this approach, it should be possible to generalize with the Base.Cartesian
technology (i never used it), see the docs,
(2) the output is a mult-dim matrix, but the code with kron
returns a (very large) vector, and converting between these takes time esp for high dims.
Here is a general version that should do the same but work for an arbitrary number of vectors. However, it is a lot slower due to heavy allocations (which I do not fully understand).
EDIT: If you allocate A
outside and pass it to this function, the majority of allocations is gone, but there are still a lot of them.
The sorting is not implemented, but I expect it to boost performance additionally for vectors of different length.
function kron_nested(vectors::Vector{Vector{Float64}}, A=Array{Float64}(ntuple(i -> length(vectors[i]), length(vectors))))
n = length(vectors)
@assert n > 1 "need at least two vectors"
# sort by length, in descending order (i.e., smallest last)
# vectors = sort(TODO)
# precompute first multiplications
cache = Vector{Float64}(n-1)
cache[n-1] = vectors[n][1]
i = n-1
while i > 1
cache[i-1] = vectors[i][1] * cache[i]
i -= 1
end
# allocate result (commented and moved to input argument)
# A = Array{Float64}(ntuple(i -> length(vectors[i]), n))
# fill result
indices = ones(Int, n)
v = vectors[1] # leftmost vector
length_v = length(v) # length of the leftmost vector
lengths = [length(v) for v in vectors]
while true
# loop where result is written (only one multiplication missing)
indices[1] = 1
while indices[1] <= length_v
A[indices...] = v[indices[1]] * cache[1]
indices[1] += 1
end
# reset indices that reached their bounds
i = 2
while i <= n && indices[i] == lengths[i]
indices[i] = 1
i += 1
end
# termination criterion: last index reached its bounds
if i == n+1
break
end
# increment next index
indices[i] += 1
# update cache for all indices that have changed
if i == n
# special case: next index is the last index
cache[i-1] = vectors[i][indices[i]]
else
# normal case
cache[i-1] = vectors[i][indices[i]] * cache[i]
end
i -= 1
while i > 1
cache[i-1] = vectors[i][1] * cache[i]
i -= 1
end
end
return A
end
The generalization of fauxc
from above can be done with Base.Cartesian
. I failed to get it right, but @chriselrod gave the correct answer and a full explanation in discourse.
I copy the approach here that uses the dimensions of the preallocated output array A
to know how many loops should be generated:
@generated function fauxc(A::AbstractArray{T,n}, v::VectorOfArray) where {T,n}
quote
cache = Vector{Float64}($n-2)
@inbounds for $(Symbol(:i_, n)) in eachindex(v[$n])
@nloops $(n-1) i i -> eachindex(v[i]) d -> d == 1 ? nothing : d == $n-1 ? cache[d-1] = v[i_{d}, d] * v[i_{d+1}, d+1] : cache[d-1] = cache[d] * v[i_{d}, d] begin (@nref $n A i) = v[i_1, 1] * cache[1] end
end
end
end
The arguments to @nloops
are N itersym rangeexpr preexpr bodyexpr
; the anonymous function d -> d == 1 ...
is the pre-expression that defines the cache, and the last part,
begin (@nref $n A i) = v[i_1, 1] * cache[1] end
is just the content of the inner-most loop.
Great ! (You did my job again Marcelo :P) To update this issue:
kron()
computation is dropped until further notice. The function is kept as a reference for performance bottom line, and to verify the correctness of future method results (ensure that a method pass the correctness checks before merging into master
branch).
I want to compare with a single loop implementation (using a conversion from single index to multi-index): I think this is the most naive implementation.
Next steps:
Look how to define workers for parallel computing in Julia. How can we parallelize the current _loop_tensor!()
function while keeping the allocations as small as possible ?
Current implementation only handle multivariate monomials, how does the current implementation fare when dealing with polynomials (sum of multivariate monomials time a scalar coefficient) ?
I am curious on implementing the matrix methods from Sapo
and the recent paper from J.Garloff and J.Titi (using the Pascal triangle, I will send him a mail to know if he have some implementation abd the paper of the method using FFTs).
I will open an issue for each of these sub-problems. Let keep this issue for the general discussions.
Thanks for the clear write-up.
While reading Smith's thesis, i noticed he pays special attention to the computation of rigorous error bounds using interval arithmetic. While we don't support this yet, it shouldn't be hard to setup here, extending our interface and using Interval
and IntervalBox
from the IntervalArithmetic package.
I also liked his approach for an API dealing with either the Bernstein tensor form and the implicit Bernstein form (see Appendix B towards the end of the thesis), they are fields of some bigger data type that contains the polynomial and the (box) domain. I think we should follow this idea, using a type parametric in a concrete subtype of an AbstractBernsteinForm
.
I plan to send a PR in these directions that set the ground for your point (2) in "Next Steps" above. Points (1) and (3) are quite independent and can be done separately.
@mforets Yes. Implicit Bernstein form is important when one wants to compute a given subset of the Bernstein coefficients (for example when bounding a polynomial). The computation of the full set of Bernstein coefficients is necessary when one wants to provide a precise convex enclosure of the polynomial function.
To summarize:
Implicit form: efficient to compute subset of the Bernstein coefficients, in particular in the context of bounding a polynomial (finding its minimum and maximum over a set). This is not implemented and its require a careful implementation (see A.P. Smith thesis and the Fast Bound paper)
Full Bernstein expansion: Exponential cost in dimension, but necessary when one wants to compute a convex enclosure of a polynomial (as a polytope in vertex form). For example, in the approximation of non-linear differential systems by a convex differential inclusion, or to perform a non-linear lift, for 3D modelling, etc ... This is what we are currently implementing. Note that this later can also be used for bounding a polynomial, even though it is less efficient.
In my previous work, I only used the Bernstein expansion for the min/max application. However, in a library point of view, I think it is important to give both possibilities as one may need the full Bernstein expansion.
About, the interval arithmetic: Bernstein coefficients of monomials are rational. So rational arithmetic is enough currently. Interval arithmetic is important to allow us to handle cases where the coefficients of the polynomial are not rational (ex: sqrt(2), pi, etc...).
Example: a*x^2*y^2 + b*x^3*y
If a
and b
are rational then rational arithmetic is enough.
If a
or b
are not rational (for example irrational or parameters in a set) then we can still compute the Bernstein expansions of x^2*y^2
and x^3*y
independently using rational arithmetic, and then output the one of a*x^2*y^2 + b*x^3*y
symbolically (as done in Sapo
using Ginac
) or approximate its solution using intervals as you propose.
( I do not see the point of complex numbers currently. So I consider Bernstein expansion of polynomials over the reals.)
Idea
Let
M
be a multivariate monomial. Given four variablesx_1,x_2,x_3,x_4
one example would beM = (x_1^2)*(x_2^2)*(x_4)
. In this example,M
is constituted by the product of 4 univariate monomialsM_i
:M_1 = (x_1^2), M_2 = (x_2^2), M_3 = (x_3^0) = 1, M_4 = (x_4)
Assuming
B_i
is the Bernstein expansion ofM_i
, one can obtain the Bernstein expansionB
ofM
by computing the Kronecker products (that we note here(x)
) of its constituting univariate monomial Bernstein expansionsB_i
.B = ((B_1(x)B_2) (x)B_3) (x)B_4
Please note the ordering of the operations is important, so commutativity does not hold.
Current implementation
Points to improve
berncoeffs
array is always re-allocated, copied etc...kron
using GPU implementation ? (ex:@cuda
macro) Only when dealing with float32 ?... Worker 1:
B12 = (B_1(x)B_2)
... Worker 2:B34 = (B_3(x)B_4)
... Syncro ... Worker 1B = (B12 (x) B34)
Dictionaries can be used to keep the Bernstein expansion of often used multivariate polynomials. Key = [M_degrees, expansion_degree] (in the example:
M_degrees = 2,2,0,1
) Data = BBernstein expansions of multivariate monomials when the
expansion_degree = M_degrees
and on the[0,1]^n
box are sparse. Currently, we do not take advantage of this.