Open ghost opened 4 years ago
I can see that you have implemented a LinearSolver
in your code which seems like a standard LU solver with partial pivoting. May I ask what are the size of your LHS
matrices?
You can by-pass all that code and use one of Fastor's solvers if you want to get a bit more performance
// inversion based solve - very fast
Tensor<T,N> x = solve(A,b);
// or equivalent of what you are doing but faster using template recursion
Tensor<T,N> x = solve<SolveCompType::BlockLUPiv>(A,b);
// or exactly equivalent of what you are doing I guess
Tensor<T,N> x = solve<SolveCompType::SimpleLUPiv>(A,b);
You will gain significant speed up if your matrix sizes are less than 32x32
using inversion based or block LU decomposition. For matrices of bigger sizes you will still get the speed-up but given that Fastor implements most things using compile time recursion rather than for loops, the compilation time might blow-up quite quickly, specially with GCC that has nearly quadratic template instantiation cost. Clang should be fine.
These solvers are not heavily tested with Visual Studio but compiling with clang Fastor solvers are faster than Eigen, MKL and straight-forward for loop style implementations (such as yours) by quite a big margin. Here are the results from a comparison I did a while ago on my macbook with Julia 1.4
backslash solve and Fastor BlockLU
solve compiled with clang using -mavx2 -mfma -O3
if it is of any interest
Size | Julia | Julia StaticArrays | Fastor |
---|---|---|---|
8x8 | 942ns | 350ns | 95ns |
16x16 | 2.831 μs | 3.036 μs | 807ns |
32x32 | 25.851 μs | 25.517 μs | 4.11538 µs |
64x64 | 95.105 μs | 104.029 μs | 25.8607 µs |
This is quite exciting! It may take me bit to take a closer look, though.
Regarding your questions:
LinearSolver
basically does LU decomposition, so you should be able to replace it. I'm not using the one that ships with Julia to avoid allocations and the overhead when calling out to LAPACK which can be significant for small sizes.Obviously, ACME was never meant to be used for effect simulation "in production", e.g. by wrapping it (including the whole Julia runtime) in a plugin (though I do think that would be a fun project to do). It's mainly intended as a test bed for algorithm and model development. So I've optimized it to a point where waiting for results is not too much of a torture, but haven't tried to squeeze the last bit of performance out of it. So being able to speed it up by quite a bit when switching to C++ is not unexpected. What I had envisioned (but haven't come around to actualy doing so far) is to be able to "export" a model from Julia to C++. While this is trivial for all the matrices, it needs some thought on how to approach the functions defining the nonlinearities. No I'm not 100% clear: You have created the C++ code from the circuit description within C++, right? Or have you done it from Julia? Further I've always shied away from reimplementing the solvers in C++. Now with latter having been done by someone else...
Tangentially, what happened to @maxprod2016?
Who is @maxprod2016? :)
I perfectly understand the meaning of the project and it's objective. ACME is already a very good and solid project, my goal is not to criticize it. However, for my personal case (old computer), ACME is extremely slow not only for the running part, but the discretization part take several minutes which is clearly unmanageable for testing and research. As you known, I think, i'm autodidact and independant researcher.
This is why I undertook to develop a version in C ++ which is not only faster at the level of the exploitation of the discrete model, but which reduces the discretization to a handful of seconds in release mode.
Nevertheless, my C++ clone is not intended to replace your work, already because it was not written for that, and moreover it is not sufficiently well written, and very limited in deployment (C ++ 17 , VS2019 ...), and depends on a few auxiliary libraries which can constrain the ease of dissemination of the project.
And of course, the rough code to reproduce the Sparse Arrays library of Julia that is hacky to simulate the one-based indexing of Julia. For example, the blockdiag
code become :
template <typename... Tv, typename... Ti>
constexpr auto blockdiag(const SparseMatrix<Tv, Ti>& ...args)
-> SparseMatrix<std::common_type_t<Tv...>, std::common_type_t<Ti...>>
{
constexpr auto num = sizeof...(Tv);
auto mX = details::apply<size_t>([](auto x) { return size(x, 1); }, args...);
auto nX = details::apply<size_t>([](auto x) { return size(x, 2); }, args...);
auto m = details::sum(mX);
auto n = details::sum(nX);
typedef typename std::common_type<Tv...>::type Tnv;
typedef typename std::common_type<Ti...>::type Tni;
auto colptr = Array<Tni>(n+1);
auto nnzX = details::apply<size_t>([](auto x) { return nnz(x); }, args...);
auto nnz_res = details::sum(nnzX);
auto rowval = Array<Tni>(nnz_res);
auto nzval = Array<Tnv>(nnz_res);
auto nnz_sofar = Tni(0);
auto nX_sofar = Tni(0);
auto mX_sofar = Tni(0);
auto X = details::collect(args...);
for (auto i : range(1, num))
{
colptr.apply((regspace<Tni>(1, nX[i-1]+1) + nX_sofar), X[i-1].colptr + nnz_sofar);
rowval.apply((regspace<Tni>(1, nnzX[i-1]) + nnz_sofar), X[i-1].rowval + mX_sofar);
nzval.apply((regspace<Tni>(1, nnzX[i-1]) + nnz_sofar), X[i-1].nzval);
nnz_sofar += nnzX[i-1];
nX_sofar += nX[i-1];
mX_sofar += mX[i-1];
}
colptr(n+1) = nnz_sofar + 1;
return SparseMatrix<Tnv, Tni>(m, n, colptr, rowval, nzval);
}
You're right, the Fastor-based code is generated inside my C++ code but this is not so hard to potentialy include such idea inside the Julia ACME code. I've do that very quickly and this is not very well coded, but works for the proof of concept.
For example, for the Diode element, the cpp code include both of pre-generated encoded string of the non-linearity code, and the precomputed lambda function used in the armadillo-based internal process (slow one) :
Diode::Diode(double is, double η)
: Element(2, 1, 0, 2, 0, 0)
{
this->operator[](Matrice::mv)(0, 0) = 1;
this->operator[](Matrice::mi)(1, 0) = 1;
this->operator[](Matrice::mq)(0, 0) = -1;
this->operator[](Matrice::mq)(1, 1) = -1;
auto vt = 25e-3;
auto vt_x_n = vt * η;
auto oo_vt_x_n = 1.0 / vt_x_n;
auto iso_vt_x_n = is / vt_x_n;
std::string code;
code += " // Diode\n";
code += " {\n";
code += " auto ex = exp(q(%Q_IND_0%) * T(" + fmt::format("{}", oo_vt_x_n) + "));\n";
code += " res(%RES_ROW_OFFSET_0%) = T(" + fmt::format("{}", is) + ") * (ex - T(1.0)) - q(%Q_IND_1%);\n";
code += " Jq(%JQ_ROW_OFFSET_0%, %JQ_COL_OFFSET_0%) = T(" + fmt::format("{}", iso_vt_x_n) + ") * ex;\n";
code += " Jq(%JQ_ROW_OFFSET_0%, %JQ_COL_OFFSET_1%) = T(-1.0);\n";
code += " }\n";
Element::nonlinear_eq_code(code);
Element::nonlinear_eq([=](arma::vec const& q, arma::vec& res, arma::mat& Jq, arma::uword& ror, arma::uword& jor, arma::uword& joc)
{
auto v = q[0];
auto i = q[1];
auto ex = exp(v * oo_vt_x_n); // exp(v * (1.0 / (25e-3 * η)));
res(ror) = is * (ex - 1.0) - i;
ror += 1;
Jq(jor, joc) = iso_vt_x_n * ex; // is / (25e-3 * η) * ex;
Jq(jor, joc+1) = -1.0;
jor += 1;
joc += 2;
});
make_pins({ Pin{ "+", "-" } });
}
This is ofcourse not elegant, but works for the current proof of concept, I will optimize that soon to externalize the template-like strings inside a specific file. The indexing in the c++ lambda function nonlinear_eq
is added to avoid the diagonal concatenation that is for my code a performance bottleneck. Some variables are precomputed for optimization.
The string-template code is updated when the indexing is know using simple string replace, something like:
for (auto&& [i, q] : iter::enumerate(q_indices))
{
FAST::findAndReplaceAll(code, fmt::format("%Q_IND_{}%", i), std::to_string(q));
}
I still hope that this little study, because it is nothing other than that, is still interesting. I will update the code to replace the decomposition and activate the potentiometers for dynamic manipulation.
@romeric, I answer you as soon as I've update the code. For information, the LHS size is (for the current case), 7x7 without potentiometers, and 13x13 with activated potentiometers.
I perfectly understand the meaning of the project and it's objective. ACME is already a very good and solid project, my goal is not to criticize it.
No worries, I didn't feel criticized in any way, just wanted to set this straight for others following along. (I would have doubted there are any, but @romeric proofed me wrong...)
However, for my personal case (old computer), ACME is extremely slow not only for the running part, but the discretization part take several minutes which is clearly unmanageable for testing and research.
Ah, that's bad. Is that only the first time in a fresh Julia session (when a lot of internal code gets compiled), or every time? And is that for the SuperOver example or for a more complicated circuit? But maybe that's worth it's own issue to not derail this one.
For example, for the Diode element, the cpp code include both of pre-generated encoded string of the non-linearity code, and the precomputed lambda function used in the armadillo-based internal process (slow one)
Yes, I had considered doing something similar in the Julia project (Julia code + stringified C++ code template) but the introspection functionality of Julia might allow synthesizing C++ code from the Julia code, at least in not-too-complicated cases, i.e. only calls to a limited set of functions ( exp
, ...). That would avoid the code duplication and thereby enforce equivalence of the Julia and the generated C++ code.
No worries, I didn't feel criticized in any way, just wanted to set this straight for others following along. (I would have doubted there are any, but @romeric proofed me wrong...)
You're right. @romeric (Roman) is the creator of the Fastor library, he helped me with the introduction of some useful functions for this project. https://github.com/romeric/Fastor/issues/90
Ah, that's bad. Is that only the first time in a fresh Julia session (when a lot of internal code gets compiled), or every time? And is that for the SuperOver example or for a more complicated circuit? But maybe that's worth it's own issue to not derail this one.
Ah ok, my fault (pure Julia newbie), I run script from windows console, not in Julia instance, so I think each time I run my script, the code is compiled.
After a quick test (Julia 1.4.2 64 bits) on SuperOver example with no other process on the computer, I get 61.01500 seconds overall timing for the discretization (without decomposition), 64.106999 (with decomposition) :
start = time()
model=superover(drive=1.0, tone=1.0, level=1.0)
elapsed = time() - start
When calling timing on function, I get 39.420185 seconds (without decomposition), 41.874466 seconds (with decomposition)
@time model=superover(drive=1.0, tone=1.0, level=1.0)
In all case, the same test in C++ is reduce for me to 0.374724 seconds (without decomposition), 0.509646 (with decomposition). I think my code is maybe not so bad :)
I've update the code after removing the temporary isfinite helpers (thank's @romeric) and the LinearSolver
instances in favor of Fastor's solver.
Unfortunatly, the solver become unstable and go to NaN after 7189 processed samples. Maybe i've do something wrong by replacing the last_linsolver
by a last_J
tensor. I've tested with SimpleLUPiv
& BlockLUPiv
, same result.
That is a bit strange since SimpleLUPiv
is essentially identical to what you have implemented. If you think this is related to Fastor and can narrow it down to a minimal working example, you can submit an issue on Fastor. Would be interesting to know why you are observing this discrepency.
By the way it seems like you are doing LU decomposition twice once for solving and once for determinant on nleq->J
. I would just decompose once and get the factors and work on them directly
Tensor<T,N,N> L, U;
Tensor<size_t,N> p;
lu<LUCompType::SimpleLUPiv>(nleq->J, L, U, p);
Tensor<T,N> y = forward_subs(L, b);
Tensor<T,N> x = backward_subs(U, y); // x is your solution
T _det = product(diag(U)); // _det is your determinant
Oh, thank you Roman. So, the best way to get both decomposition and determinant is to reintroduce the LinearSolver
class design and split the process like that:
template <typename T, size_t N>
class LinearSolver
{
public:
inline bool setlhs(Tensor<T,N,N> const& J)
{
Fastor::lu<LUCompType::SimpleLUPiv>(J, L, U, p);
return product(diag(U)) != 0;
}
inline Tensor<T,N>& solve(Tensor<T,N>& x, Tensor<T,N> const& b)
{
if (all_of(x != b)) x = b;
x = Fastor::internal::backward_subs(U,
Fastor::internal:: forward_subs(L, x));
return x;
}
private:
Tensor<T,N,N> L, U;
Tensor<size_t,N> p;
};
By the way, that allow me to keep the same design for the SimpleSolver
where the solve method allow to swap the rhs
when difference occur : if (all_of(x != b)) x = b;
.
Unfortunatly, that don't change the problem. I think the method is slightly different and let me see you the different result of the setlhs
method :
Julia ACME LinearSolver (one-based in pivot indices)
┌ Info:
│ A =
│ 7×7 Array{Float64,2}:
│ -1.93782e-10 -3.27309e-10 0.0 -0.000110635 -4.81249e-11 0.0 0.0
│ 1.89659e-10 3.20345e-10 0.0 0.000110635 9.97722e-5 0.0 0.0
│ -1.13501e-6 3.49486e-12 0.0 0.000203447 5.48017e-13 0.0 -1.0
│ 5.54995e-8 -1.70891e-13 0.0 -5.22136e-8 -2.67969e-14 -5.2211e-8 -1.0
│ 0.0 0.0 0.0 0.0 0.0 9.22111e-8 -1.0
│ 5.98052e-18 3.08622e-12 -0.000111011 1.58109e-16 1.58398e-18 0.0 0.0
└ -3.61183e-6 4.03106e-5 0.000111011 1.37001e-9 1.3725e-11 0.0 0.0
┌ Info:
│ factors =
│ 7×7 Array{Float64,2}:
│ -276868.0 4.03106e-5 0.000111011 1.37001e-9 1.3725e-11 0.0 0.0
│ 0.314247 -78942.3 -3.4885e-5 0.000203446 -3.76504e-12 0.0 -1.0
│ -1.65582e-12 -2.43639e-7 -9008.09 4.95675e-11 6.6669e-19 0.0 -2.43639e-7
│ 5.3652e-5 0.00019657 -8.11971e-6 -9035.46 -4.81249e-11 0.0 0.00019657
│ -5.25105e-5 -0.000192388 7.94695e-6 -0.999991 10022.8 0.0 4.18064e-6
│ -0.0 -0.0 -0.0 -0.0 0.0 1.08447e7 -1.0
└ -0.015366 -0.0488979 -3.81507e-18 -0.089414 -4.31286e-8 -0.566211 -0.61916
┌ Info:
│ solver.ipiv =
│ 7-element Array{Int32,1}:
│ 7
│ 3
│ 6
│ 7
│ 6
│ 6
└ 7
C++ ACME LinearSolver (zero-based in pivot indices)
A
-1.9378e-10 -3.2731e-10 0 -1.1064e-04 -4.8125e-11 0 0
1.8966e-10 3.2035e-10 0 1.1063e-04 9.9772e-05 0 0
-1.1350e-06 3.4949e-12 0 2.0345e-04 5.4802e-13 0 -1.0000e+00
5.5500e-08 -1.7089e-13 0 -5.2214e-08 -2.6797e-14 -5.2211e-08 -1.0000e+00
0 0 0 0 0 9.2211e-08 -1.0000e+00
5.9805e-18 3.0862e-12 -1.1101e-04 1.5811e-16 1.5840e-18 0 0
-3.6118e-06 4.0311e-05 1.1101e-04 1.3700e-09 1.3725e-11 0 0
factors
-2.7687e+05 4.0311e-05 1.1101e-04 1.3700e-09 1.3725e-11 0 0
3.1425e-01 -7.8942e+04 -3.4885e-05 2.0345e-04 -3.7650e-12 0 -1.0000e+00
-1.6558e-12 -2.4364e-07 -9.0081e+03 4.9568e-11 6.6669e-19 0 -2.4364e-07
5.3652e-05 1.9657e-04 -8.1197e-06 -9.0355e+03 -4.8125e-11 0 1.9657e-04
-5.2510e-05 -1.9239e-04 7.9469e-06 -9.9999e-01 1.0023e+04 0 4.1806e-06
0 0 0 0 0 1.0845e+07 -1.0000e+00
-1.5366e-02 -4.8898e-02 -3.8151e-18 -8.9414e-02 -4.3129e-08 -5.6621e-01 -6.1916e-01
ipiv
6
2
5
6
5
5
6
C++ Fastor LU Solver (SimpleLUPiv)
A
[-1.93781844857805e-10, -3.27309180513646e-10, 0, -0.000110635007888986, -4.81248665931391e-11, 0, 0]
[ 1.89658827001645e-10, 3.20345155598241e-10, 0, 0.000110634898880709, 9.97722159650126e-05, 0, 0]
[-1.13500832994181e-06, 3.49485842178722e-12, 0, 0.000203446518193683, 5.48017351963061e-13, 0, -1]
[ 5.54995375369355e-08, -1.7089127986946e-13, 0, -5.22136409305715e-08, -2.67969042991298e-14, -5.22109661173598e-08, -1]
[ 0, 0, 0, 0, 0, 9.22110678118406e-08, -1]
[ 5.98052186065483e-18, 3.08622390873733e-12, -0.000111011340985907, 1.58109417995967e-16, 1.58397712569323e-18, 0, 0]
[-3.61182967708414e-06, 4.03105545826988e-05, 0.000111011340979684, 1.37000530292441e-09, 1.37250335205588e-11, 0, 0]
L
[ 1, 0, 0, 0, 0, 0, 0]
[ -65.0785544776915, 1, 0, 0, 0, 0, 0]
[ 0.00341730463745621, 7.94694668140661e-06, 1, 0, 0, 0, 0]
[ -20.4507709489733, 1.00196313121033e-23, 1.26081521794353e-18, 1, 0, 0, 0]
[ -0.00349159386650459, -8.11970638675431e-06, -1.0217391297908, 0.0118836483836328, 1, 0, 0]
[ 1.07758048554454e-10, 7.65612081542758e-08, 125834.501816522, -68790.4044028179, -123157.230208897, 1, 0]
[ 0, 0, -0, 0, 0, -1.28223381615304e-06, 1]
U
[ 5.54995375369355e-08, -1.7089127986946e-13, 0, -5.22136409305715e-08, -2.67969042991298e-14, -5.22109661173598e-08, -1]
[ 0, 4.03105434613414e-05, 0.000111011340979684, -3.3966182704759e-06, 1.19811297242944e-11, -3.39781420280151e-06, -65.0785544776915]
[ 0, 0, -8.82201207796999e-10, 0.00011063510430337, 9.97722159650089e-05, 2.05423024941912e-10, 0.00393448043999344]
[ 0, 0, 0, 0.0002023787089826, -1.25794328216752e-22, -1.06775450909073e-06, -21.4507709489733]
[ 0, 0, 0, 0, 0.000101941128992526, 1.26888191460727e-08, 0.254913419515445]
[ 0, 0, 0, 0, 0, -0.0719143939663768, -1444707.87102258]
[ 0, 0, 0, 0, 0, 0, -2.85245328668762]
p
[3]
[6]
[1]
[2]
[0]
[5]
[4]
Note for @martinholters (if you read that) : This is the first call to setlhs
during the process of ModelRunner, so after the initial solution computation. As you can see, the Julia's matrix A is slightly different in element orders than the original ACME code. It's because, for obscur C++ reasons and code simplification, I change the container of the pins inside the struct Element
from Dict
to OrderedDict
(pins :: OrderedDict{Symbol, Vector{Tuple{Int, Int}}}
). All the subsequents matrices and vectors are reordered so, no change in the process and result, but help me to debug my code easily !
The updated full code is always here : https://codeshare.io/5QEE4q
So @romeric, I think the result difference occur in the method employed (pivot indices difference) in the LU factorization. I'm not able to go in deep in this problem, my science background is limited. But i'm sure that, because the intensive calls to the linear solver during the process, the gain of performance can be greatly increase using SIMD vectorization on small tensors (7x7, 13x 13 ... up to ?)
In any case, thank you very much for your help.
From a superficial glance: Shouldn't solve
(in the SimpleLUPiv
-based implementation) also apply the permutation p
somewhere?
@martinholters You're right.
inline Tensor<T,N>& solve(Tensor<T,N>& x, Tensor<T,N> const& b)
{
if (all_of(x != b)) x = b;
x = Fastor::internal::backward_subs(U,
Fastor::internal:: forward_subs(L, p, x));
return x;
}
Anyway, the problem stay the same than the initial implementation. This implementation allow to solve the double call. I'm currently investigate the question, maybe (because the Julia implementation works on single factors matrix), the idea will be to keep your initial code and see with @romeric what is the best way to optimize it.
My bad for giving incorrect instructions. Copy pasted the code in a haste. You do need to apply the pivot. Specifically in the forward substitution step:
Tensor<T,N> y = internal::forward_subs(L, p, b);
Also, I hope I am not daydreaming here, but a pivot is nothing but a permutation vector of rows (or columns) and you shouldn't get duplicated entries in a pivot vector (in your own case you do which is a bit odd).
Different permutations of rows however (different pivots) does not impact the final result and in fact Julia, NumPy and Fastor will all give you different pivots while giving you the same final solution vector.
Here is an example of solving your matrix A
with a right hand side vector of all ones in Julia and Fastor
Tensor<double,7,7> L, U;
Tensor<size_t, 7> p;
lu<LUCompType::SimpleLUPiv>(A, L, U, p);
Tensor<double,7> b; b.ones();
Tensor<double,7> y = internal::forward_subs(L, p, b);
Tensor<double,7> x = internal::backward_subs(U, y);
print(x);
[-1570845.87851703]
[-91132.8413880232]
[-9008.09117721191]
[-9035.71802215123]
[ 20045.5893366909]
[-600389.114233065]
[-1.05536252132604]
Julia F = lu(A); F.U\(F.L\b[F.p])
7-element Array{Float64,1}:
-1.57084587851703e6
-91132.84138897745
-9008.091176865402
-9035.718022151223
20045.58933669087
-600389.1142330621
-1.0553625213260358
While Fastor pivot vector
[3]
[6]
[1]
[2]
[0]
[5]
[4]
and Julia pivot vector
7-element Array{Int64,1}:
7
3
6
1
2
5
4
are different. So maybe your LU decomposition does not do a conventional LU or does not do what it is intended to do.
Same observation by using the Julia standard lu! function
┌ Info: A
│ A =
│ 7×7 Array{Float64,2}:
│ -1.93782e-10 -3.27309e-10 0.0 -0.000110635 -4.81249e-11 0.0 0.0
│ 1.89659e-10 3.20345e-10 0.0 0.000110635 9.97722e-5 0.0 0.0
│ -1.13501e-6 3.49486e-12 0.0 0.000203447 5.48017e-13 0.0 -1.0
│ 5.54995e-8 -1.70891e-13 0.0 -5.22136e-8 -2.67969e-14 -5.2211e-8 -1.0
│ 0.0 0.0 0.0 0.0 0.0 9.22111e-8 -1.0
│ 5.98052e-18 3.08622e-12 -0.000111011 1.58109e-16 1.58398e-18 0.0 0.0
└ -3.61183e-6 4.03106e-5 0.000111011 1.37001e-9 1.3725e-11 0.0 0.0
┌ Info: L
│ l =
│ 7×7 Array{Float64,2}:
│ 1.0 0.0 0.0 0.0 0.0 0.0 0.0
│ 0.314247 1.0 0.0 0.0 0.0 0.0 0.0
│ -1.65582e-12 -2.43639e-7 1.0 0.0 0.0 0.0 0.0
│ 5.3652e-5 0.00019657 -8.11971e-6 1.0 0.0 0.0 0.0
│ -5.25105e-5 -0.000192388 7.94695e-6 -0.999991 1.0 0.0 0.0
│ -0.0 -0.0 -0.0 -0.0 0.0 1.0 0.0
└ -0.015366 -0.0488979 -1.08603e-18 -0.089414 -4.31286e-8 -0.566211 1.0
┌ Info: U
│ u =
│ 7×7 Array{Float64,2}:
│ -3.61183e-6 4.03106e-5 0.000111011 1.37001e-9 1.3725e-11 0.0 0.0
│ 0.0 -1.26675e-5 -3.4885e-5 0.000203446 -3.76504e-12 0.0 -1.0
│ 0.0 0.0 -0.000111011 4.95675e-11 6.6669e-19 0.0 -2.43639e-7
│ 0.0 0.0 0.0 -0.000110675 -4.81249e-11 0.0 0.00019657
│ 0.0 0.0 0.0 0.0 9.97722e-5 0.0 4.18064e-6
│ 0.0 0.0 0.0 0.0 0.0 9.22111e-8 -1.0
└ 0.0 0.0 0.0 0.0 0.0 0.0 -1.61509
┌ Info: p
│ p =
│ 7-element Array{Int32,1}:
│ 7
│ 3
│ 6
│ 1
│ 2
│ 5
└ 4
I take a look to the difference in implementation right now
Check your final solution vector to see if you are actually getting the same results
OK. I think the solution is in @martinholters code comment:
based on Julia's generic_lufact!, but storing inverses on the diagonal; faster than calling out to dgetrf for sizes up to about 60×60
The only difference is here:
# Scale first column
fkkinv = factors[k,k] = inv(factors[k,k])
while initial implementation is:
# Scale first column
Akkinv = inv(A[k,k])
The diagonal become reciprocal of is value.
Also, I hope I am not daydreaming here, but a pivot is nothing but a permutation vector of rows (or columns) and you shouldn't get duplicated entries in a pivot vector (in your own case you do which is a bit odd).
I guess you're referring to ipiv
. That's not a permutation vector as such. IIRC, it encodes the permutation as "for i from 1 to N, exchange entries (rows) i and ipiv[i]". Here, duplicate entries do make sense.
@martinholters it's me, i'm confuse. I talk about pivot indices but the variable name induce me in error. Sorry.
OK. I think the solution is in @martinholters code comment:
based on Julia's generic_lufact!, but storing inverses on the diagonal; faster than calling out to dgetrf for sizes up to about 60×60
Hah, right! Sometimes I actually do leave helpful comments :smile: Yes, the idea here is to avoid repeated reciprocal/division when solving for multiple RHSs.
Ah cool 😄 ! Now the big question is : is the change "conventional" in sense of @romeric question?
It's not conventional in how the result is stored: Instead of storing L (except for the implicit ones on the diagonal) and U, it stores U with the diagonal elements inverted. Of course, this is dealt with in the back-substitution by replacing a division in the conventional implementation with a multiplication. As long as the LU decomposition and the solving part agree on the storage format, that shouldn't make much difference, though. Likewise, different pivoting should only result in different numerical round-off noise which I don't expect to be problematic here.
OK. Thank's for the clarification Martin. So, this is not a possible method to be implemented inside Fastor, Therefore, there's two solutions. First one, to duplicate the current Fastor solver and modify it as the indication you provide. Another solution is to keep the current julia algorithm and attempt to adapt the code with the help of Fastor optimization. I need the @romeric expertize to know what to do, in particular on the potential gain because it would seem to be substantial.
As long as the LU decomposition and the solving part agree on the storage format, that shouldn't make much difference, though. Likewise, different pivoting should only result in different numerical round-off noise which I don't expect to be problematic here.
@dorpxam as long as your implementation is supposed to do a dgetrf
based solve, the observable behaviour should be the same. Of course numerical round-off error will acrue and make two seemingly identical implementations diverge especially if you have an iterative (nonlinear) solver implemented on top of that, which you do, but not to the point where you'd get completely different results. In your specific case you are working with matrices that have a pretty high condition number as well.
If I were you I would stick with your own (Julia or Julia style) implementation and keep things at a high level using Fastor's tensors. Given that your matrix sizes are small and compile time constants the compiler will do a pretty good job optimising them. Fastor's internal solver routines (like most optimised BLAS routines) get pretty low-level to squeeze that last bit of performance. But that is going to be futile exercise for you here.
Also don't directly compare the result of a determinant product(diag(U))
to zero but check if it is close enough (you define what is an acceptable "close-enough" for your application).
Hi Martin,
First of all, thank you very much for all your scientific works, and especially for ACME that is a great project.
Since a moment, I attempt to port ACME to C++ using Armadillo for Linear Algebra. Unfortunatly, the Sparse Matrix of Armadillo do not cover the potential of Julia's Sparse Arrays standard library, in this case the maintain of structural zero. So i've write a C++ version of the Julia Sparse Arrays library.
After that, I've follow the ACME code design using latest standards of C++, heavy usage of meta programming to keep in place with the flexibility of the Julia code. The discretization of the circuit is, for my own configuration more faster in C++ than the Julia code. I mean, computation of incidence and topology matrices, the non-linear decomposition and so on. But of course, no possibility to reproduce the symbolic trick for the build of the circuit.
For example, the build of the SuperOver circuit become:
For a build and design simplification, the DiscreteModel class follow the PIMPL idiom and hide the Model Runner :
Now, I can run the circuit using a 30 seconds unprocessed guitar excerpt. No problem, I get the same result than Julia ACME process ... but in 57 seconds where the excerpt take 28 seconds with the Julia code !!! What ?!?!
My fault, by following the ACME design using lambda function (for genericity) make it unusable in such of process. Optimization with Intel MKL and cblas call do not have effect. Poor performance.
So after reading this : https://medium.com/@romanpoya/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982 , I take a look to the famous Fastor library.
After some code change, I'm able to produce on-the-fly a static header only file that is a freezed version of the DiscreteModel of a circuit using Fastor at backend for the matrices and linear algebra. No dependencies except Fastor that is header only too. There's some changes on the design to allow some generic code, for example LinearSolver, SimpleSolver, HomotopySolver become generic (template based), ParametricNonLinEq become abstract and template base to allow multiple solvers instance (same as ACME core solvers module).
So, in a subnamespace details, you will find reusable code :
The others classes are produced programmaticaly including compile time static versions of the precomputed matrices as Fastor's Tensor. Because the possibility of multiple solvers for the non-linear parts, the code produce overrided classes of ParametricNonLinEq base class and for the SuperOver with fixed potentiometer values case, only one :
Ofcourse, this is not yet fully optimized, but this class can be used by the generic Solver classes and be runned with a class that is a mixed version of the DiscreteModel and ModelRunner of ACME :
As you can see, for the moment the SuperOver class can be instanciated with a compile-time chunck size for the I/O, fixed by default at 1024 columns. A static method return the sampling rate used for the discretization of the circuit.
SuperOver-Fastor The full code (single file header-only, no dependencies except Fastor) is available here for the proof of concept : https://codeshare.io/5QEE4q
Now this is the report for the performance with my configuration, all the tests are running 10 times on a 30 seconds audio sample file at 44100 Hz with the SuperOver circuit discretized with the 3 fixed potentiometers at full value (1.0) :
Julia 32bits and 64 bits
min : 23.407315 seconds max : 29.414382 seconds mean : 25.026536 seconds
21.512428 seconds (430.89 k allocations: 21.301 MiB) 21.094247 seconds (430.88 k allocations: 21.295 MiB) 20.971705 seconds (430.88 k allocations: 21.295 MiB) 21.836396 seconds (430.88 k allocations: 21.295 MiB) 21.668114 seconds (430.88 k allocations: 21.295 MiB) 20.907806 seconds (430.88 k allocations: 21.295 MiB) 20.982056 seconds (430.88 k allocations: 21.295 MiB) 21.016102 seconds (430.88 k allocations: 21.295 MiB) 20.938037 seconds (430.88 k allocations: 21.295 MiB) 21.450474 seconds (430.88 k allocations: 21.295 MiB)
min : 20.907806 seconds max : 21.836396 seconds mean : 21.237736 seconds
17.2277 seconds 17.4744 seconds 17.4055 seconds 17.3357 seconds 17.2751 seconds 17.3368 seconds 17.4383 seconds 17.3296 seconds 17.3059 seconds 17.2541 seconds
min : 17.2277 seconds max : 17.4744 seconds mean : 17.3383 seconds
13.9769 seconds 13.9896 seconds 14.1125 seconds 13.9835 seconds 13.9530 seconds 14.7414 seconds 14.2672 seconds 14.1663 seconds 14.7137 seconds 14.3044 seconds
min : 13.9769 seconds max : 14.7414 seconds mean : 14.2208 seconds
9.24264 seconds 9.26479 seconds 9.35169 seconds 9.35638 seconds 9.51796 seconds 9.44793 seconds 9.34992 seconds 9.52753 seconds 9.87442 seconds 9.74665 seconds
min : 9.24264 seconds max : 9.87442 seconds mean : 9.46799 seconds
8.62546 seconds 8.29256 seconds 8.29380 seconds 8.30038 seconds 8.30197 seconds 8.32373 seconds 8.51325 seconds 8.30997 seconds 8.41125 seconds 8.31047 seconds
min : 8.29380 seconds max : 8.62546 seconds mean : 8.36828 seconds