Proof of concept - C++ backend using Fastor Library

Hi Martin,

First of all, thank you very much for all your scientific works, and especially for ACME that is a great project.

Since a moment, I attempt to port ACME to C++ using Armadillo for Linear Algebra. Unfortunatly, the Sparse Matrix of Armadillo do not cover the potential of Julia's Sparse Arrays standard library, in this case the maintain of structural zero. So i've write a C++ version of the Julia Sparse Arrays library.

After that, I've follow the ACME code design using latest standards of C++, heavy usage of meta programming to keep in place with the flexibility of the Julia code. The discretization of the circuit is, for my own configuration more faster in C++ than the Julia code. I mean, computation of incidence and topology matrices, the non-linear decomposition and so on. But of course, no possibility to reproduce the symbolic trick for the build of the circuit.

For example, the build of the SuperOver circuit become:

static Circuit superover(std::optional<double> drive = std::nullopt, 
                         std::optional<double> tone  = std::nullopt, 
                         std::optional<double> level = std::nullopt,
                         bool sym = false)
{
    Circuit circ;

    // power supply
    circ.add("j3", VoltageSource2(9));
    circ.connect({ Pin{ "j3", "+" }, "vcc" });
    circ.connect({ Pin{ "j3", "-" }, "gnd" });
    circ.add("d4", Diode(12e-9, 2));
    circ.connect({ Pin{ "d4", "-" }, "vcc" });
    circ.connect({ Pin{ "d4", "+" }, "gnd" });
    circ.add("c11", Capacitor(100e-6));
    circ.connect({ Pin{ "c11", "1" }, "vcc" });
    circ.connect({ Pin{ "c11", "2" }, "gnd" });
    circ.add("r17", Resistor(33e3));
    circ.connect({ Pin{ "r17", "1" }, "vcc" });
    circ.connect({ Pin{ "r17", "2" }, "vb" });
    circ.add("r18", Resistor(33e3));
    circ.connect({ Pin{ "r18", "1" }, "vb" });
    circ.connect({ Pin{ "r18", "2" }, "gnd" });
    circ.add("c12", Capacitor(47e-6));
    circ.connect({ Pin{ "c12", "1" }, "vb" });
    circ.connect({ Pin{ "c12", "2" }, "gnd" });

    // input stage
    circ.add("j1", VoltageSource1());
    circ.connect({ Pin{ "j1", "-" }, "gnd" });
    circ.add("r1", Resistor(2.2e6));
    circ.connect({ Pin{ "r1", "1" }, Pin{ "j1", "+" } });
    circ.connect({ Pin{ "r1", "2" }, "gnd" });

    [...]

    circ.add("r16", Resistor(100e3));
    circ.connect({ Pin{ "r16", "1" }, Pin{ "c10", "2" } });
    circ.connect({ Pin{ "r16", "2" }, "gnd" });
    circ.add("j2", VoltageProbe());
    circ.connect({ Pin{ "j2", "+" }, Pin{ "c10", "2" } });
    circ.connect({ Pin{ "j2", "-" }, "gnd" });

    if (sym)
        circ.connect({ Pin{ "d3", "-" }, Pin{ "d3", "+" } });

    return circ;
}

For a build and design simplification, the DiscreteModel class follow the PIMPL idiom and hide the Model Runner :

#pragma once

#include "Circuit.h"

namespace ACME {

class DiscreteModel
{
public:
    DiscreteModel(Circuit& circ, int fs, bool decompose_nonlinearity = true);

    DiscreteModel(DiscreteModel&&) noexcept;
    DiscreteModel(DiscreteModel const&);
    ~DiscreteModel();

    DiscreteModel& operator=(DiscreteModel&&) noexcept;
    DiscreteModel& operator=(DiscreteModel const&);

    arma::mat run(arma::mat& u, bool showprogress = true);

private:
    struct impl;
    std::unique_ptr<impl> pimpl;
    friend class ModelRunner;
};

} // namespace ACME

Now, I can run the circuit using a 30 seconds unprocessed guitar excerpt. No problem, I get the same result than Julia ACME process ... but in 57 seconds where the excerpt take 28 seconds with the Julia code !!! What ?!?!

Notice that the poor performance result is due to my computer configuration that is an old Dell Workstation with a old 4x core Intel Xeon CPU at 2.33GHz.

My fault, by following the ACME design using lambda function (for genericity) make it unusable in such of process. Optimization with Intel MKL and cblas call do not have effect. Poor performance.

So after reading this : https://medium.com/@romanpoya/a-look-at-the-performance-of-expression-templates-in-c-eigen-vs-blaze-vs-fastor-vs-armadillo-vs-2474ed38d982 , I take a look to the famous Fastor library.

After some code change, I'm able to produce on-the-fly a static header only file that is a freezed version of the DiscreteModel of a circuit using Fastor at backend for the matrices and linear algebra. No dependencies except Fastor that is header only too. There's some changes on the design to allow some generic code, for example LinearSolver, SimpleSolver, HomotopySolver become generic (template based), ParametricNonLinEq become abstract and template base to allow multiple solvers instance (same as ACME core solvers module).

So, in a subnamespace details, you will find reusable code :

template <typename T, size_t N>
class LinearSolver;

template <typename T, size_t NN, size_t NP, size_t NQ>
class SimpleSolver;

template <typename T, size_t NN, size_t NP, size_t NQ, typename BaseSolver>
class HomotopySolver : public BaseSolver;

template <typename T, size_t NN, size_t NP, size_t NQ>
class ParametricNonLinEq;

The others classes are produced programmaticaly including compile time static versions of the precomputed matrices as Fastor's Tensor. Because the possibility of multiple solvers for the non-linear parts, the code produce overrided classes of ParametricNonLinEq base class and for the SuperOver with fixed potentiometer values case, only one :

template <typename T>
class ParametricNonLinEq1 : public details::ParametricNonLinEq<T,7,5,14>
{
public:
    ParametricNonLinEq1()
        : details::ParametricNonLinEq<T,7,5,14>()
    {}

    using details::ParametricNonLinEq<T,7,5,14>::pfull;
    using details::ParametricNonLinEq<T,7,5,14>::res;
    using details::ParametricNonLinEq<T,7,5,14>::Jp;
    using details::ParametricNonLinEq<T,7,5,14>::Jq;
    using details::ParametricNonLinEq<T,7,5,14>::J;

    void set_p(Tensor<T,5> const& p) override
    {
        pfull = pexp % p + q0;
    }

    void calc_Jp() override
    {
        Jp = Jq % pexp;
    }

    void evaluate(Tensor<T,7> const& z) override
    {
        auto q = fq % z + pfull;
        circuit_nl_functions(q, res, Jq);
        J = Jq % fq;
    }

private:
    Tensor<T,14,5> pexp
    {
        { T(0.00010993688221983862), T(-0.00010993677005192213), T(-0.00020110133949630878), T(0.0), T(3.377304698293206e-06) }, 
        { T(-0.0001099368771198124), T(0.00020970893278571498), T(0.0002011012362826367), T(0.0), T(-3.3773116728561845e-06) }, 
        { T(1.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(0.0), T(1.0), T(0.0), T(0.0), T(0.0) }, 
        { T(-1.2076550429373785e-11), T(1.4662213917948478e-11), T(6.456138776923646e-07), T(0.0), T(2.265254354456357e-06) }, 
        { T(0.0), T(0.0), T(1.0), T(0.0), T(0.0) }, 
        { T(6.038275214686893e-12), T(-7.331106958974239e-12), T(-3.228069388461823e-07), T(0.0), T(-1.1326271772281786e-06) }, 
        { T(0.0), T(0.0), T(-0.5), T(0.0), T(0.0) }, 
        { T(6.038275214686893e-12), T(-7.331106958974239e-12), T(-3.228069388461823e-07), T(0.0), T(-1.1326271772281786e-06) }, 
        { T(0.0), T(0.0), T(-0.5), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.00010989987877699456), T(-0.00010989987877699456) }, 
        { T(3.252427955830663e-11), T(-3.9487927218946627e-11), T(-6.582221707599249e-10), T(-0.00010989987877699456), T(0.00015021030443268768) }, 
        { T(0.0), T(0.0), T(0.0), T(1.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(1.0) }
    };
    Tensor<T,14> q0
    {
        T(8.999929592414643),
        T(0.0),
        T(7.040758535545553e-09),
        T(-0.0008979564112870561),
        T(7.040758535545553e-05),
        T(-6.815835949221252e-11),
        T(-7.040758535545553e-05),
        T(0.0),
        T(0.0),
        T(0.0),
        T(7.040758535545553e-05),
        T(-8.999929592414643),
        T(0.0),
        T(-1.882825269815799e-09)
    };
    Tensor<T,14,7> fq
    {
        { T(-1.937818386579128e-06), T(-3.2730917004165573e-06), T(0.0), T(-1.0063500466925097), T(0.9999994867571643), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(1.0), T(0.0), T(0.0) }, 
        { T(1.9378183865791285e-10), T(3.2730917004165573e-10), T(0.0), T(0.00011063500466925096), T(5.132428355931152e-11), T(0.0), T(0.0) }, 
        { T(-1.8965882081412746e-10), T(-3.203451451471524e-10), T(0.0), T(-0.00011063489566740015), T(-9.977221915804349e-05), T(0.0), T(0.0) }, 
        { T(-1.0629862204078668), T(3.2730917004165573e-06), T(0.0), T(1.0000512308698843), T(5.132428355931152e-07), T(0.0), T(0.0) }, 
        { T(1.0290282869388836e-06), T(-3.1685302036946343e-12), T(0.0), T(-0.0002033468127878947), T(-4.968468882798792e-13), T(0.0), T(1.0) }, 
        { T(1.0629862204078668), T(-3.2730917004165573e-06), T(0.0), T(-1.0000512308698843), T(-5.132428355931152e-07), T(-1.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(1.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(1.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(1.0) }, 
        { T(1.937818386579128e-06), T(1.0000032730917003), T(-1.010113378684807), T(5.123086988438217e-05), T(5.132428355931152e-07), T(0.0), T(0.0) }, 
        { T(1.937818386579128e-06), T(1.0000032730917003), T(0.0), T(5.123086988438217e-05), T(5.132428355931152e-07), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.00011101133786848072), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(3.6118296770781734e-06), T(-4.0310557662762626e-05), T(-0.00011101133786848072), T(-1.3700054607182438e-09), T(-1.3725035101374279e-11), T(0.0), T(0.0) }
    };

    inline void circuit_nl_functions(Tensor<T,14> const& q, Tensor<T,7>& res, Tensor<T,7,14>& Jq)
    {
        // BJT
        {
            auto expE = exp(q(0) * T(40.0));
            auto expC = exp(q(1) * T(40.0));
            auto i_f = T(7.984031936127745e-14) * (expE - T(1.0));
            auto i_r = T(7.272727272727272e-14) * (expC - T(1.0));
            auto di_f1 = T(3.1936127744510976e-12) * expE;
            auto di_r2 = T(2.909090909090909e-12) * expC;
            auto i_cc = i_f - i_r;
            auto di_cc1 =  di_f1;
            auto di_cc2 = -di_r2;
            auto iBE = T(0.002) * i_f;
            auto diBE1 = T(0.002) * di_f1;
            auto iBC = T(0.1) * i_r;
            auto diBC2 = T(0.1) * di_r2;
            res(0) =  i_cc + iBE - q(2);
            res(1) = -i_cc + iBC - q(3);
            Jq(0, 0) = di_cc1 + diBE1;
            Jq(0, 1) = di_cc2;
            Jq(0, 2) = T(-1.0);
            Jq(0, 3) = T(0.0);
            Jq(1, 0) = -di_cc1;
            Jq(1, 1) = -di_cc2 + diBC2;
            Jq(1, 2) = T(0.0);
            Jq(1, 3) = T(-1.0);
        }
        // Diode
        {
            auto ex = exp(q(4) * T(20.0));
            res(2) = T(4e-09) * (ex - T(1.0)) - q(5);
            Jq(2, 4) = T(8e-08) * ex;
            Jq(2, 5) = T(-1.0);
        }
        // Diode
        {
            auto ex = exp(q(6) * T(20.0));
            res(3) = T(3e-09) * (ex - T(1.0)) - q(7);
            Jq(3, 6) = T(6e-08) * ex;
            Jq(3, 7) = T(-1.0);
        }
        // Diode
        {
            auto ex = exp(q(8) * T(20.0));
            res(4) = T(5e-09) * (ex - T(1.0)) - q(9);
            Jq(4, 8) = T(1e-07) * ex;
            Jq(4, 9) = T(-1.0);
        }
        // BJT
        {
            auto expE = exp(q(10) * T(40.0));
            auto expC = exp(q(11) * T(40.0));
            auto i_f = T(7.984031936127745e-14) * (expE - T(1.0));
            auto i_r = T(7.272727272727272e-14) * (expC - T(1.0));
            auto di_f1 = T(3.1936127744510976e-12) * expE;
            auto di_r2 = T(2.909090909090909e-12) * expC;
            auto i_cc = i_f - i_r;
            auto di_cc1 =  di_f1;
            auto di_cc2 = -di_r2;
            auto iBE = T(0.002) * i_f;
            auto diBE1 = T(0.002) * di_f1;
            auto iBC = T(0.1) * i_r;
            auto diBC2 = T(0.1) * di_r2;
            res(5) =  i_cc + iBE - q(12);
            res(6) = -i_cc + iBC - q(13);
            Jq(5, 10) = di_cc1 + diBE1;
            Jq(5, 11) = di_cc2;
            Jq(5, 12) = T(-1.0);
            Jq(5, 13) = T(0.0);
            Jq(6, 10) = -di_cc1;
            Jq(6, 11) = -di_cc2 + diBC2;
            Jq(6, 12) = T(0.0);
            Jq(6, 13) = T(-1.0);
        }
    }
};

Ofcourse, this is not yet fully optimized, but this class can be used by the generic Solver classes and be runned with a class that is a mixed version of the DiscreteModel and ModelRunner of ACME :

template <typename T, size_t NN, size_t NP, size_t NQ>
using Solver = details::HomotopySolver<T, NN, NP, NQ, details::SimpleSolver<T, NN, NP, NQ>>;

template <typename T, size_t N = 1024>
class SuperOver
{
public:
    SuperOver()
    {
        x.zeros();
        z.zeros();
        {
            Tensor<T,5> initial_p; initial_p.zeros();
            Tensor<T,7> initial_z
            {
                T(-0.010348922599724008),
                T(-0.0009709058692962849),
                T(6.295088302373311e-10),
                T(-5.944270660783201e-05),
                T(-8.99999859753792),
                T(-0.004054501064337741),
                T(-3.894532151393544e-10)
            };
            solver1 = Solver<T,7,5,14>(new ParametricNonLinEq1<T>(), initial_p, initial_z);
        }
    }

    static unsigned sample_rate() { return 44100; }

    inline auto run(Tensor<T,1,N>& y, Tensor<T,1,N>& u)
    {
        size_t n;
        for (n = 0; n < N; ++n)
        {
            ucur(all) = u(all, n);

            zoff = 0;
            zoff = step1(ucur, z, zoff);

            y(all, n) = dy % x
                      + ey % ucur
                      + fy % z
                      + y0;

                    x = a % x
                      + b % ucur
                      + c % z
                      + x0;
        }
    }

private:
    Solver<T,7,5,14> solver1;
    Tensor<T,1> ucur;
    Tensor<T,7> z;
    size_t zoff;
    Tensor<T,11> x;
    Tensor<T,11,11> a
    {
        { T(-1.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(0.0), T(0.9998327049430342), T(-3.3119064337876245e-09), T(-0.2753099030712581), T(-0.09763196245844691), T(0.151132392937088), T(0.12112438150133976), T(-0.057154146168772166), T(-1.3485730457405402e-06), T(0.006063406659081113), T(4.553925562512561e-13) }, 
        { T(0.0), T(-1.4385808878982962e-12), T(0.9528901879906215), T(-3.7085740309887527e-09), T(-1.2781321416587701e-09), T(2.2301907007175122e-11), T(1.787376379112158e-11), T(-8.433972546580727e-12), T(-1.9900267622314502e-16), T(-6.258834122664423e-12), T(7.40304118278617e-14) }, 
        { T(0.0), T(-4.79402435217444e-06), T(-1.4963396643229237e-10), T(0.9874818825516254), T(2.3430474867751796e-07), T(-3.623010527883805e-07), T(-2.903645610543879e-07), T(1.3701236992116482e-07), T(3.232857131013707e-12), T(-1.4534002380296183e-08), T(-1.5069414997066776e-14) }, 
        { T(0.0), T(-9.702084592242064e-05), T(-3.028272894651297e-09), T(-0.2533400448094604), T(0.9023646062656903), T(-7.3322019326349085e-06), T(-5.876360500048623e-06), T(2.772838653929169e-06), T(6.542614524997446e-11), T(-2.9413726380743705e-07), T(-3.049728752229398e-13) }, 
        { T(0.0), T(2.8407621000674715e-07), T(-1.407748033050824e-09), T(-0.11776971818541009), T(-1.1425050994910129e-08), T(0.8814885257958858), T(1.7159718679487037e-08), T(-8.097040888597346e-09), T(-1.9105265015029784e-13), T(8.773246472772191e-10), T(-1.8959306913411251e-13) }, 
        { T(0.0), T(1.805923670328223e-07), T(-8.94930798564265e-10), T(-0.0748683325907033), T(-7.263110848291436e-09), T(1.1960961687376217), T(0.7963340231287017), T(-0.8476240378135657), T(-1.214556133289602e-13), T(5.577310916118805e-10), T(-1.2052773136875375e-13) }, 
        { T(0.0), T(2.0106769209323297e-07), T(-9.963968754989889e-10), T(-0.08335680567354949), T(-8.086592803880034e-09), T(1.3317079792520914), T(1.2145561332896021e-08), T(0.05627346356811614), T(-1.3522609103065637e-13), T(6.209659092549509e-10), T(-1.3419300703414533e-13) }, 
        { T(0.0), T(-2.8659962443149402e-05), T(-1.638869105894781e-10), T(-0.01371051102731143), T(1.3994657850289228e-06), T(0.21910000479758424), T(0.1755967205461711), T(-0.08285764194174144), T(0.9999980449435422), T(-0.01290200016382254), T(-1.5835748515331856e-12) }, 
        { T(0.0), T(6.073667126633634e-06), T(-5.085630512712335e-11), T(-0.00425453736382261), T(-2.964285628808909e-07), T(0.06795759377249123), T(0.05446431009293129), T(-0.025699707205508387), T(-6.063940194020241e-07), T(0.9810341912808801), T(-2.3022384477962127e-12) }, 
        { T(0.0), T(1.688189231015496e-14), T(1.8285460518107934e-17), T(-1.0956965857370556e-11), T(-1.2481173136804766e-15), T(1.7499491724910554e-10), T(1.4024889506314758e-10), T(-6.617830158628519e-11), T(-1.5615013033107816e-15), T(-4.8838075442745065e-11), T(0.9997755129539692) }
    };
    Tensor<T,11,1> b
    {
        { T(0.0) }, 
        { T(1.5565960238801836e-16) }, 
        { T(2.2141611644407873e-09) }, 
        { T(7.032796422317741e-18) }, 
        { T(1.4232882604861094e-16) }, 
        { T(6.616415755338873e-17) }, 
        { T(4.206174753252045e-17) }, 
        { T(4.683065314845248e-17) }, 
        { T(7.702684797705469e-18) }, 
        { T(2.3902463409747976e-18) }, 
        { T(-8.594166443510728e-25) }
    };
    Tensor<T,11,7> c
    {
        { T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(1.8215492833843807e-10), T(3.076706198391564e-10), T(0.0), T(4.815701769131923e-09), T(4.8244826545752824e-11), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(0.0), T(-2.2141612097799968e-09), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(2.2675736961451245e-10), T(0.0), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(0.0), T(4.589086371487518e-09), T(0.0), T(0.0), T(0.0) }, 
        { T(2.2675736961451243e-09), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(1.4415374705206309e-09), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(1.6049770930286387e-09), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0), T(0.0) }, 
        { T(2.640585846566829e-10), T(-6.063963264090016e-10), T(0.0), T(-1.0016023325381089e-13), T(-1.0034286406783248e-15), T(0.0), T(0.0) }, 
        { T(8.190089970698806e-11), T(-8.913958653687671e-10), T(0.0), T(-3.1065883462998725e-14), T(-3.112252857454485e-16), T(0.0), T(0.0) }, 
        { T(0.0), T(0.0), T(2.2675736961451245e-10), T(0.0), T(0.0), T(0.0), T(0.0) }
    };
    Tensor<T,11> x0
    {
        T(0.0018),
        T(6.618313023412819e-09),
        T(-1.9927450888019973e-08),
        T(0.0),
        T(0.0),
        T(0.0),
        T(0.0),
        T(0.0),
        T(-1.3765216534396293e-13),
        T(-4.269445056271653e-14),
        T(0.0)
    };
    Tensor<T,1,11> dy
    {
        { T(0.0), T(7.444914508778337e-05), T(8.063888088485599e-08), T(-0.048320219431004154), T(-5.5041973533309025e-06), T(0.7717275850685554), T(0.6184976272284809), T(-0.2918463099955177), T(-6.886220747600547e-06), T(-0.21537591270250575), T(-989987.8729959748) }
    };
    Tensor<T,1,1> ey
    {
        { T(-3.790027401588231e-15) }
    };
    Tensor<T,1,7> fy
    {
        { T(0.0), T(0.0), T(1.0), T(0.0), T(0.0), T(0.0), T(0.0) }
    };
    Tensor<T,1> y0
    {
        T(0.0)
    };

    inline size_t step1(Tensor<T,1>& ucur, Tensor<T,7>& z, size_t zoff)
    {
        static Tensor<T,5,11> dq
        {
            { T(0.0), T(-0.21141634654818867), T(-7.30066103277782e-05), T(-552.0486499942749), T(0.010448290258156558), T(-0.015979491098510375), T(-0.012806691791209369), T(0.006043007406936469), T(1.4258697663228571e-07), T(-0.000640384167472097), T(-7.351548241013832e-09) }, 
            { T(0.0), T(0.25668184770318436), T(-2077.5426356800776), T(552.0422513319257), T(-0.01271575195351654), T(0.019401328839459733), T(0.015549108369917042), T(-0.007337053048716599), T(-1.7312045826823736e-07), T(0.000777345385719729), T(1.0687437942602077e-08) }, 
            { T(0.0), T(2.8524128701175555), T(8.903122312237315e-05), T(7448.197317395306), T(2870.480575787507), T(0.21556673243808452), T(0.17276499518999064), T(-0.08152145476859855), T(-1.9235286312536373e-06), T(0.008647636778678377), T(8.951729642173757e-09) }, 
            { T(0.0), T(8.264699199359463e-09), T(8.95183005124492e-12), T(-5.36409220513433e-06), T(-6.110283120854142e-10), T(8.567051168847211e-05), T(6.866024906711453e-05), T(-3.239824932458178e-05), T(-7.644485780488263e-10), T(-2.3909168213750276e-05), T(-9.899878254854752) }, 
            { T(0.0), T(0.2891235279864489), T(-2.2428073226842565e-06), T(-187.6280211810524), T(-0.014111136906504044), T(2996.931407037596), T(2401.8772946344843), T(-1133.35766321726), T(-0.026741989833685528), T(-836.3920759740552), T(9.899878150750983) }
        };
        static Tensor<T,5,1> eq
        {
            { T(3.4313106854055753e-12) }, 
            { T(9.764450387696363e-05) }, 
            { T(-4.184467486751538e-12) }, 
            { T(-4.207360124085112e-19) }, 
            { T(1.0541194416616006e-13) }
        };
        static Tensor<T,5> p
        {
            T(0.0),
            T(0.0),
            T(0.0),
            T(0.0),
            T(0.0)
        };

        p = dq % x
          + eq % ucur;

        auto zsub = solver1.solve(p);
        if (!solver1.hasconverged())
        {
            if (Fastor::isfinite(zsub))
                std::cout << "Failed to converge while solving non-linear equation." << std::endl;
            else
                throw std::exception("Failed to converge while solving non-linear equation, got non-finite result.");
        }
        z(seq(zoff, 7)) = zsub;
        zoff += 7;
        return zoff;
    }
};

As you can see, for the moment the SuperOver class can be instanciated with a compile-time chunck size for the I/O, fixed by default at 1024 columns. A static method return the sampling rate used for the discretization of the circuit.

SuperOver-Fastor The full code (single file header-only, no dependencies except Fastor) is available here for the proof of concept : https://codeshare.io/5QEE4q

Now this is the report for the performance with my configuration, all the tests are running 10 times on a 30 seconds audio sample file at 44100 Hz with the SuperOver circuit discretized with the 3 fixed potentiometers at full value (1.0) :

Julia 32bits and 64 bits

Julia Version 1.4.2 Commit 44fa15b150* (2020-05-23 18:35 UTC) Platform Info: OS: Windows (i686-w64-mingw32) CPU: Intel(R) Xeon(R) CPU E5410 @ 2.33GHz WORD_SIZE: 32 LIBM: libopenlibm LLVM: libLLVM-8.0.1 (ORCJIT, penryn)


28.947501 seconds (429.69 k allocations: 15.341 MiB, 0.19% gc time)
23.705922 seconds (429.70 k allocations: 15.464 MiB)
29.414382 seconds (429.70 k allocations: 15.464 MiB, 0.16% gc time)
23.898812 seconds (429.70 k allocations: 15.464 MiB, 0.19% gc time)
23.407315 seconds (429.70 k allocations: 15.464 MiB)
23.861988 seconds (429.70 k allocations: 15.464 MiB)
25.481217 seconds (429.70 k allocations: 15.464 MiB)
24.148169 seconds (429.70 k allocations: 15.464 MiB)
23.860177 seconds (429.70 k allocations: 15.464 MiB, 0.19% gc time)
23.539879 seconds (429.70 k allocations: 15.464 MiB)

min : 23.407315 seconds max : 29.414382 seconds mean : 25.026536 seconds

> Julia Version 1.4.2
> Commit 44fa15b150* (2020-05-23 18:35 UTC)
> Platform Info:
>   OS: Windows (x86_64-w64-mingw32)
>   CPU: Intel(R) Xeon(R) CPU           E5410  @ 2.33GHz
>   WORD_SIZE: 64
>   LIBM: libopenlibm
>   LLVM: libLLVM-8.0.1 (ORCJIT, penryn)

21.512428 seconds (430.89 k allocations: 21.301 MiB) 21.094247 seconds (430.88 k allocations: 21.295 MiB) 20.971705 seconds (430.88 k allocations: 21.295 MiB) 21.836396 seconds (430.88 k allocations: 21.295 MiB) 21.668114 seconds (430.88 k allocations: 21.295 MiB) 20.907806 seconds (430.88 k allocations: 21.295 MiB) 20.982056 seconds (430.88 k allocations: 21.295 MiB) 21.016102 seconds (430.88 k allocations: 21.295 MiB) 20.938037 seconds (430.88 k allocations: 21.295 MiB) 21.450474 seconds (430.88 k allocations: 21.295 MiB)

min : 20.907806 seconds max : 21.836396 seconds mean : 21.237736 seconds

**Visual Studio 2019 32bits and 64 bits** 
> Visual Studio 2019 16.5.5 (v142 - native)
> x86 Release (/O2 /Ob2 /Oi /Ot)

17.2277 seconds 17.4744 seconds 17.4055 seconds 17.3357 seconds 17.2751 seconds 17.3368 seconds 17.4383 seconds 17.3296 seconds 17.3059 seconds 17.2541 seconds

min : 17.2277 seconds max : 17.4744 seconds mean : 17.3383 seconds

> Visual Studio 2019 16.5.5 (v142 - native)
> x64 Release (/O2 /Ob2 /Oi /Ot)

13.9769 seconds 13.9896 seconds 14.1125 seconds 13.9835 seconds 13.9530 seconds 14.7414 seconds 14.2672 seconds 14.1663 seconds 14.7137 seconds 14.3044 seconds

min : 13.9769 seconds max : 14.7414 seconds mean : 14.2208 seconds

**LLVM/Clang 32bits and 64 bits** 
> Visual Studio 2019 16.5.5 (LLVM - clang-cl)
> x86 Release (/O2 /Ob2 /Oi /Ot)

9.24264 seconds 9.26479 seconds 9.35169 seconds 9.35638 seconds 9.51796 seconds 9.44793 seconds 9.34992 seconds 9.52753 seconds 9.87442 seconds 9.74665 seconds

min : 9.24264 seconds max : 9.87442 seconds mean : 9.46799 seconds

> Visual Studio 2019 16.5.5 (LLVM - clang-cl)
> x64 Release (/O2 /Ob2 /Oi /Ot)

8.62546 seconds 8.29256 seconds 8.29380 seconds 8.30038 seconds 8.30197 seconds 8.32373 seconds 8.51325 seconds 8.30997 seconds 8.41125 seconds 8.31047 seconds

min : 8.29380 seconds max : 8.62546 seconds mean : 8.36828 seconds

**Conclusion:**

-------------------------------------------------------------------		Compiler	Min	Max	Mean
x86	Julia	23.4073	29.4143	25.0265	0.00 %
x86	Visual Studio 2019	17.2277	17.4744	17.3383	30.72 %
x86	LLVM / Clang	9.2426	9.8744	9.4679	62.16 %
-------------------------------------------------------------------
x64	Julia	20.9078	21.8363	21.2377	0.00 %
x64	Visual Studio 2019	13.9769	14.7414	14.2208	33.03 %
x64	LLVM / Clang	8.2938	8.6254	8.3682	60.59 %
-------------------------------------------------------------------


As you can see, the gain of performance is substantial in both release build (x86, x64), the LLVM/Clang build is generated from the embedded version of the Visual Studio 2019 IDE and possibly out-of-date version. But the result is clear. 

This is a first and possibly buggy attempt to use ACME with C++ and so can be used inside a audio plugin. I need you help for some points that are confused for me :
1) I want to remove the LinearSolver code to use the Fastor embedded solver but I've always a doubt about the method used in the ACME version.
2) I need a little example/help about the generation of lookup-table for the circuit_nl_functions part that is clearly time consuming.
3) Do you think you'll add some Triode/Pentode model element in future ? If not, maybe I can contribute to the ACME project for this point.

If you see some errors in my code, don't hesitate to correct me. Maybe this little study can be a entry point to make a builtin code generator for C++ backend inside the julia ACME core code. This is maybe an answer about the implementation aspects of your 2017's paper : "Automatic Decomposition of Non-Linear Equation Systems in Audio Effect Circuit Simulation".

Cheers,
Maxime Coorevits

I can see that you have implemented a LinearSolver in your code which seems like a standard LU solver with partial pivoting. May I ask what are the size of your LHS matrices?

You can by-pass all that code and use one of Fastor's solvers if you want to get a bit more performance

// inversion based solve - very fast
Tensor<T,N> x = solve(A,b);
// or equivalent of what you are doing but faster using template recursion
Tensor<T,N> x = solve<SolveCompType::BlockLUPiv>(A,b); 
// or exactly equivalent of what you are doing I guess
Tensor<T,N> x = solve<SolveCompType::SimpleLUPiv>(A,b);

You will gain significant speed up if your matrix sizes are less than 32x32 using inversion based or block LU decomposition. For matrices of bigger sizes you will still get the speed-up but given that Fastor implements most things using compile time recursion rather than for loops, the compilation time might blow-up quite quickly, specially with GCC that has nearly quadratic template instantiation cost. Clang should be fine.

These solvers are not heavily tested with Visual Studio but compiling with clang Fastor solvers are faster than Eigen, MKL and straight-forward for loop style implementations (such as yours) by quite a big margin. Here are the results from a comparison I did a while ago on my macbook with Julia 1.4 backslash solve and Fastor BlockLU solve compiled with clang using -mavx2 -mfma -O3 if it is of any interest

Size	Julia	Julia StaticArrays	Fastor
8x8	942ns	350ns	95ns
16x16	2.831 μs	3.036 μs	807ns
32x32	25.851 μs	25.517 μs	4.11538 µs
64x64	95.105 μs	104.029 μs	25.8607 µs

This is quite exciting! It may take me bit to take a closer look, though.

Regarding your questions:

Yes, LinearSolver basically does LU decomposition, so you should be able to replace it. I'm not using the one that ships with Julia to avoid allocations and the overhead when calling out to LAPACK which can be significant for small sizes.
I'll need to take a closer look at what this is about.
If you want to add a triode and/or pentode model to ACME, that would be much appreciated. Otherwise I would probably do it myself eventually (unless someone else steps up to do it), but not in the near future.

Obviously, ACME was never meant to be used for effect simulation "in production", e.g. by wrapping it (including the whole Julia runtime) in a plugin (though I do think that would be a fun project to do). It's mainly intended as a test bed for algorithm and model development. So I've optimized it to a point where waiting for results is not too much of a torture, but haven't tried to squeeze the last bit of performance out of it. So being able to speed it up by quite a bit when switching to C++ is not unexpected. What I had envisioned (but haven't come around to actualy doing so far) is to be able to "export" a model from Julia to C++. While this is trivial for all the matrices, it needs some thought on how to approach the functions defining the nonlinearities. No I'm not 100% clear: You have created the C++ code from the circuit description within C++, right? Or have you done it from Julia? Further I've always shied away from reimplementing the solvers in C++. Now with latter having been done by someone else...

Tangentially, what happened to @maxprod2016?

Who is @maxprod2016? :)

I perfectly understand the meaning of the project and it's objective. ACME is already a very good and solid project, my goal is not to criticize it. However, for my personal case (old computer), ACME is extremely slow not only for the running part, but the discretization part take several minutes which is clearly unmanageable for testing and research. As you known, I think, i'm autodidact and independant researcher.

This is why I undertook to develop a version in C ++ which is not only faster at the level of the exploitation of the discrete model, but which reduces the discretization to a handful of seconds in release mode.

Nevertheless, my C++ clone is not intended to replace your work, already because it was not written for that, and moreover it is not sufficiently well written, and very limited in deployment (C ++ 17 , VS2019 ...), and depends on a few auxiliary libraries which can constrain the ease of dissemination of the project.

Armadillo / MP+++ / arma-mppp (arbitrary-precision arithmetic with Armadillo using C backend : GMP/MFPR/FLINT/ARB)
Intel MKL for mimic some Julia's BLAS call (obsolete with the Fastor usage)
Some other helpers : stduuid, ordered-map, cppitertools, fmt and catch2

And of course, the rough code to reproduce the Sparse Arrays library of Julia that is hacky to simulate the one-based indexing of Julia. For example, the blockdiag code become :

template <typename... Tv, typename... Ti>
constexpr auto blockdiag(const SparseMatrix<Tv, Ti>& ...args)
    -> SparseMatrix<std::common_type_t<Tv...>, std::common_type_t<Ti...>>
{
    constexpr auto num = sizeof...(Tv);

    auto mX = details::apply<size_t>([](auto x) { return size(x, 1); }, args...);
    auto nX = details::apply<size_t>([](auto x) { return size(x, 2); }, args...);
    auto m = details::sum(mX);
    auto n = details::sum(nX);

    typedef typename std::common_type<Tv...>::type Tnv;
    typedef typename std::common_type<Ti...>::type Tni;

    auto colptr = Array<Tni>(n+1);
    auto nnzX = details::apply<size_t>([](auto x) { return nnz(x); }, args...);
    auto nnz_res = details::sum(nnzX);
    auto rowval = Array<Tni>(nnz_res);
    auto nzval  = Array<Tnv>(nnz_res);

    auto nnz_sofar = Tni(0);
    auto nX_sofar  = Tni(0);
    auto mX_sofar  = Tni(0);

    auto X = details::collect(args...);

    for (auto i : range(1, num))
    {
        colptr.apply((regspace<Tni>(1,   nX[i-1]+1) +  nX_sofar), X[i-1].colptr + nnz_sofar);
        rowval.apply((regspace<Tni>(1, nnzX[i-1])   + nnz_sofar), X[i-1].rowval +  mX_sofar);
         nzval.apply((regspace<Tni>(1, nnzX[i-1])   + nnz_sofar), X[i-1].nzval);
        nnz_sofar += nnzX[i-1];
         nX_sofar +=   nX[i-1];
         mX_sofar +=   mX[i-1];
    }
    colptr(n+1) = nnz_sofar + 1;

    return SparseMatrix<Tnv, Tni>(m, n, colptr, rowval, nzval);
}

You're right, the Fastor-based code is generated inside my C++ code but this is not so hard to potentialy include such idea inside the Julia ACME code. I've do that very quickly and this is not very well coded, but works for the proof of concept.

For example, for the Diode element, the cpp code include both of pre-generated encoded string of the non-linearity code, and the precomputed lambda function used in the armadillo-based internal process (slow one) :

Diode::Diode(double is, double η)
    : Element(2, 1, 0, 2, 0, 0)
{
    this->operator[](Matrice::mv)(0, 0) = 1;
    this->operator[](Matrice::mi)(1, 0) = 1;
    this->operator[](Matrice::mq)(0, 0) = -1;
    this->operator[](Matrice::mq)(1, 1) = -1;

    auto vt = 25e-3;
    auto vt_x_n = vt * η;
    auto oo_vt_x_n = 1.0 / vt_x_n;
    auto iso_vt_x_n = is / vt_x_n;

    std::string code;
    code += "        // Diode\n";
    code += "        {\n";
    code += "            auto ex = exp(q(%Q_IND_0%) * T(" + fmt::format("{}", oo_vt_x_n) + "));\n";
    code += "            res(%RES_ROW_OFFSET_0%) = T(" + fmt::format("{}", is) + ") * (ex - T(1.0)) - q(%Q_IND_1%);\n";
    code += "            Jq(%JQ_ROW_OFFSET_0%, %JQ_COL_OFFSET_0%) = T(" + fmt::format("{}", iso_vt_x_n) + ") * ex;\n";
    code += "            Jq(%JQ_ROW_OFFSET_0%, %JQ_COL_OFFSET_1%) = T(-1.0);\n";
    code += "        }\n";
    Element::nonlinear_eq_code(code);

    Element::nonlinear_eq([=](arma::vec const& q, arma::vec& res, arma::mat& Jq, arma::uword& ror, arma::uword& jor, arma::uword& joc)
    {
        auto v = q[0]; 
        auto i = q[1];

        auto ex = exp(v * oo_vt_x_n); // exp(v * (1.0 / (25e-3 * η)));

        res(ror) = is * (ex - 1.0) - i;

        ror += 1;

        Jq(jor, joc) = iso_vt_x_n * ex; // is / (25e-3 * η) * ex;
        Jq(jor, joc+1) = -1.0;

        jor += 1;
        joc += 2;
    });

    make_pins({ Pin{ "+", "-" } });
}

This is ofcourse not elegant, but works for the current proof of concept, I will optimize that soon to externalize the template-like strings inside a specific file. The indexing in the c++ lambda function nonlinear_eq is added to avoid the diagonal concatenation that is for my code a performance bottleneck. Some variables are precomputed for optimization.

The string-template code is updated when the indexing is know using simple string replace, something like:

for (auto&& [i, q] : iter::enumerate(q_indices))
{
    FAST::findAndReplaceAll(code, fmt::format("%Q_IND_{}%", i), std::to_string(q));
}

I still hope that this little study, because it is nothing other than that, is still interesting. I will update the code to replace the decomposition and activate the potentiometers for dynamic manipulation.

@romeric, I answer you as soon as I've update the code. For information, the LHS size is (for the current case), 7x7 without potentiometers, and 13x13 with activated potentiometers.

I perfectly understand the meaning of the project and it's objective. ACME is already a very good and solid project, my goal is not to criticize it.

No worries, I didn't feel criticized in any way, just wanted to set this straight for others following along. (I would have doubted there are any, but @romeric proofed me wrong...)

However, for my personal case (old computer), ACME is extremely slow not only for the running part, but the discretization part take several minutes which is clearly unmanageable for testing and research.

Ah, that's bad. Is that only the first time in a fresh Julia session (when a lot of internal code gets compiled), or every time? And is that for the SuperOver example or for a more complicated circuit? But maybe that's worth it's own issue to not derail this one.

For example, for the Diode element, the cpp code include both of pre-generated encoded string of the non-linearity code, and the precomputed lambda function used in the armadillo-based internal process (slow one)

Yes, I had considered doing something similar in the Julia project (Julia code + stringified C++ code template) but the introspection functionality of Julia might allow synthesizing C++ code from the Julia code, at least in not-too-complicated cases, i.e. only calls to a limited set of functions ( exp, ...). That would avoid the code duplication and thereby enforce equivalence of the Julia and the generated C++ code.

No worries, I didn't feel criticized in any way, just wanted to set this straight for others following along. (I would have doubted there are any, but @romeric proofed me wrong...)

You're right. @romeric (Roman) is the creator of the Fastor library, he helped me with the introduction of some useful functions for this project. https://github.com/romeric/Fastor/issues/90

Ah, that's bad. Is that only the first time in a fresh Julia session (when a lot of internal code gets compiled), or every time? And is that for the SuperOver example or for a more complicated circuit? But maybe that's worth it's own issue to not derail this one.

Ah ok, my fault (pure Julia newbie), I run script from windows console, not in Julia instance, so I think each time I run my script, the code is compiled.

After a quick test (Julia 1.4.2 64 bits) on SuperOver example with no other process on the computer, I get 61.01500 seconds overall timing for the discretization (without decomposition), 64.106999 (with decomposition) :

start = time()
model=superover(drive=1.0, tone=1.0, level=1.0)
elapsed = time() - start

When calling timing on function, I get 39.420185 seconds (without decomposition), 41.874466 seconds (with decomposition)

@time model=superover(drive=1.0, tone=1.0, level=1.0)

In all case, the same test in C++ is reduce for me to 0.374724 seconds (without decomposition), 0.509646 (with decomposition). I think my code is maybe not so bad :)

I've update the code after removing the temporary isfinite helpers (thank's @romeric) and the LinearSolver instances in favor of Fastor's solver.

https://codeshare.io/5QEE4q

Unfortunatly, the solver become unstable and go to NaN after 7189 processed samples. Maybe i've do something wrong by replacing the last_linsolver by a last_J tensor. I've tested with SimpleLUPiv & BlockLUPiv, same result.

That is a bit strange since SimpleLUPiv is essentially identical to what you have implemented. If you think this is related to Fastor and can narrow it down to a minimal working example, you can submit an issue on Fastor. Would be interesting to know why you are observing this discrepency.

By the way it seems like you are doing LU decomposition twice once for solving and once for determinant on nleq->J. I would just decompose once and get the factors and work on them directly

    Tensor<T,N,N> L, U;
    Tensor<size_t,N> p;
    lu<LUCompType::SimpleLUPiv>(nleq->J, L, U, p);
    Tensor<T,N> y = forward_subs(L, b);
    Tensor<T,N> x = backward_subs(U, y); // x is your solution
    T _det = product(diag(U)); // _det is your determinant

Oh, thank you Roman. So, the best way to get both decomposition and determinant is to reintroduce the LinearSolver class design and split the process like that:

template <typename T, size_t N>
class LinearSolver
{
public:
    inline bool setlhs(Tensor<T,N,N> const& J)
    {
        Fastor::lu<LUCompType::SimpleLUPiv>(J, L, U, p);
        return product(diag(U)) != 0;
    }

    inline Tensor<T,N>& solve(Tensor<T,N>& x, Tensor<T,N> const& b)
    {
        if (all_of(x != b)) x = b;
        x = Fastor::internal::backward_subs(U, 
            Fastor::internal:: forward_subs(L, x));
        return x;
    }

private:
    Tensor<T,N,N> L, U;
    Tensor<size_t,N> p;
};

By the way, that allow me to keep the same design for the SimpleSolver where the solve method allow to swap the rhs when difference occur : if (all_of(x != b)) x = b; .

Unfortunatly, that don't change the problem. I think the method is slightly different and let me see you the different result of the setlhs method :

Julia ACME LinearSolver (one-based in pivot indices)

┌ Info:
│   A =
│    7×7 Array{Float64,2}:
│     -1.93782e-10  -3.27309e-10   0.0          -0.000110635  -4.81249e-11   0.0          0.0
│      1.89659e-10   3.20345e-10   0.0           0.000110635   9.97722e-5    0.0          0.0
│     -1.13501e-6    3.49486e-12   0.0           0.000203447   5.48017e-13   0.0         -1.0
│      5.54995e-8   -1.70891e-13   0.0          -5.22136e-8   -2.67969e-14  -5.2211e-8   -1.0
│      0.0           0.0           0.0           0.0           0.0           9.22111e-8  -1.0
│      5.98052e-18   3.08622e-12  -0.000111011   1.58109e-16   1.58398e-18   0.0          0.0
└     -3.61183e-6    4.03106e-5    0.000111011   1.37001e-9    1.3725e-11    0.0          0.0
┌ Info:
│   factors =
│    7×7 Array{Float64,2}:
│     -276868.0               4.03106e-5       0.000111011      1.37001e-9       1.3725e-11    0.0         0.0
│           0.314247     -78942.3             -3.4885e-5        0.000203446     -3.76504e-12   0.0        -1.0
│          -1.65582e-12      -2.43639e-7   -9008.09             4.95675e-11      6.6669e-19    0.0        -2.43639e-7
│           5.3652e-5         0.00019657      -8.11971e-6   -9035.46            -4.81249e-11   0.0         0.00019657
│          -5.25105e-5       -0.000192388      7.94695e-6      -0.999991     10022.8           0.0         4.18064e-6
│          -0.0              -0.0             -0.0             -0.0              0.0           1.08447e7  -1.0
└          -0.015366         -0.0488979       -3.81507e-18     -0.089414        -4.31286e-8   -0.566211   -0.61916
┌ Info:
│   solver.ipiv =
│    7-element Array{Int32,1}:
│     7
│     3
│     6
│     7
│     6
│     6
└     7

C++ ACME LinearSolver (zero-based in pivot indices)

A
  -1.9378e-10  -3.2731e-10            0  -1.1064e-04  -4.8125e-11            0            0
   1.8966e-10   3.2035e-10            0   1.1063e-04   9.9772e-05            0            0
  -1.1350e-06   3.4949e-12            0   2.0345e-04   5.4802e-13            0  -1.0000e+00
   5.5500e-08  -1.7089e-13            0  -5.2214e-08  -2.6797e-14  -5.2211e-08  -1.0000e+00
            0            0            0            0            0   9.2211e-08  -1.0000e+00
   5.9805e-18   3.0862e-12  -1.1101e-04   1.5811e-16   1.5840e-18            0            0
  -3.6118e-06   4.0311e-05   1.1101e-04   1.3700e-09   1.3725e-11            0            0

factors
  -2.7687e+05   4.0311e-05   1.1101e-04   1.3700e-09   1.3725e-11            0            0
   3.1425e-01  -7.8942e+04  -3.4885e-05   2.0345e-04  -3.7650e-12            0  -1.0000e+00
  -1.6558e-12  -2.4364e-07  -9.0081e+03   4.9568e-11   6.6669e-19            0  -2.4364e-07
   5.3652e-05   1.9657e-04  -8.1197e-06  -9.0355e+03  -4.8125e-11            0   1.9657e-04
  -5.2510e-05  -1.9239e-04   7.9469e-06  -9.9999e-01   1.0023e+04            0   4.1806e-06
            0            0            0            0            0   1.0845e+07  -1.0000e+00
  -1.5366e-02  -4.8898e-02  -3.8151e-18  -8.9414e-02  -4.3129e-08  -5.6621e-01  -6.1916e-01

ipiv
        6
        2
        5
        6
        5
        5
        6

C++ Fastor LU Solver (SimpleLUPiv)

A
[-1.93781844857805e-10, -3.27309180513646e-10,                     0, -0.000110635007888986, -4.81248665931391e-11,                     0,                     0]
[ 1.89658827001645e-10,  3.20345155598241e-10,                     0,  0.000110634898880709,  9.97722159650126e-05,                     0,                     0]
[-1.13500832994181e-06,  3.49485842178722e-12,                     0,  0.000203446518193683,  5.48017351963061e-13,                     0,                    -1]
[ 5.54995375369355e-08,  -1.7089127986946e-13,                     0, -5.22136409305715e-08, -2.67969042991298e-14, -5.22109661173598e-08,                    -1]
[                    0,                     0,                     0,                     0,                     0,  9.22110678118406e-08,                    -1]
[ 5.98052186065483e-18,  3.08622390873733e-12, -0.000111011340985907,  1.58109417995967e-16,  1.58397712569323e-18,                     0,                     0]
[-3.61182967708414e-06,  4.03105545826988e-05,  0.000111011340979684,  1.37000530292441e-09,  1.37250335205588e-11,                     0,                     0]

L
[                    1,                     0,                     0,                     0,                     0,                     0,                     0]
[    -65.0785544776915,                     1,                     0,                     0,                     0,                     0,                     0]
[  0.00341730463745621,  7.94694668140661e-06,                     1,                     0,                     0,                     0,                     0]
[    -20.4507709489733,  1.00196313121033e-23,  1.26081521794353e-18,                     1,                     0,                     0,                     0]
[ -0.00349159386650459, -8.11970638675431e-06,      -1.0217391297908,    0.0118836483836328,                     1,                     0,                     0]
[ 1.07758048554454e-10,  7.65612081542758e-08,      125834.501816522,     -68790.4044028179,     -123157.230208897,                     1,                     0]
[                    0,                     0,                    -0,                     0,                     0, -1.28223381615304e-06,                     1]

U
[ 5.54995375369355e-08,  -1.7089127986946e-13,                     0, -5.22136409305715e-08, -2.67969042991298e-14, -5.22109661173598e-08,                    -1]
[                    0,  4.03105434613414e-05,  0.000111011340979684,  -3.3966182704759e-06,  1.19811297242944e-11, -3.39781420280151e-06,     -65.0785544776915]
[                    0,                     0, -8.82201207796999e-10,   0.00011063510430337,  9.97722159650089e-05,  2.05423024941912e-10,   0.00393448043999344]
[                    0,                     0,                     0,    0.0002023787089826, -1.25794328216752e-22, -1.06775450909073e-06,     -21.4507709489733]
[                    0,                     0,                     0,                     0,  0.000101941128992526,  1.26888191460727e-08,     0.254913419515445]
[                    0,                     0,                     0,                     0,                     0,   -0.0719143939663768,     -1444707.87102258]
[                    0,                     0,                     0,                     0,                     0,                     0,     -2.85245328668762]

p
[3]
[6]
[1]
[2]
[0]
[5]
[4]

Note for @martinholters (if you read that) : This is the first call to setlhs during the process of ModelRunner, so after the initial solution computation. As you can see, the Julia's matrix A is slightly different in element orders than the original ACME code. It's because, for obscur C++ reasons and code simplification, I change the container of the pins inside the struct Element from Dict to OrderedDict (pins :: OrderedDict{Symbol, Vector{Tuple{Int, Int}}}). All the subsequents matrices and vectors are reordered so, no change in the process and result, but help me to debug my code easily !

The updated full code is always here : https://codeshare.io/5QEE4q

So @romeric, I think the result difference occur in the method employed (pivot indices difference) in the LU factorization. I'm not able to go in deep in this problem, my science background is limited. But i'm sure that, because the intensive calls to the linear solver during the process, the gain of performance can be greatly increase using SIMD vectorization on small tensors (7x7, 13x 13 ... up to ?)

In any case, thank you very much for your help.

From a superficial glance: Shouldn't solve (in the SimpleLUPiv-based implementation) also apply the permutation p somewhere?

@martinholters You're right.

    inline Tensor<T,N>& solve(Tensor<T,N>& x, Tensor<T,N> const& b)
    {
        if (all_of(x != b)) x = b;
        x = Fastor::internal::backward_subs(U, 
            Fastor::internal:: forward_subs(L, p, x));
        return x;
    }

Anyway, the problem stay the same than the initial implementation. This implementation allow to solve the double call. I'm currently investigate the question, maybe (because the Julia implementation works on single factors matrix), the idea will be to keep your initial code and see with @romeric what is the best way to optimize it.

My bad for giving incorrect instructions. Copy pasted the code in a haste. You do need to apply the pivot. Specifically in the forward substitution step:

Tensor<T,N> y = internal::forward_subs(L, p, b);

Also, I hope I am not daydreaming here, but a pivot is nothing but a permutation vector of rows (or columns) and you shouldn't get duplicated entries in a pivot vector (in your own case you do which is a bit odd).

Different permutations of rows however (different pivots) does not impact the final result and in fact Julia, NumPy and Fastor will all give you different pivots while giving you the same final solution vector.

Here is an example of solving your matrix A with a right hand side vector of all ones in Julia and Fastor

    Tensor<double,7,7> L, U;
    Tensor<size_t, 7> p;
    lu<LUCompType::SimpleLUPiv>(A, L, U, p);
    Tensor<double,7> b; b.ones();
    Tensor<double,7> y = internal::forward_subs(L, p, b);
    Tensor<double,7> x = internal::backward_subs(U, y);
    print(x);

[-1570845.87851703]
[-91132.8413880232]
[-9008.09117721191]
[-9035.71802215123]
[ 20045.5893366909]
[-600389.114233065]
[-1.05536252132604]

Julia F = lu(A); F.U\(F.L\b[F.p])

7-element Array{Float64,1}:
      -1.57084587851703e6
  -91132.84138897745
   -9008.091176865402
   -9035.718022151223
   20045.58933669087
 -600389.1142330621
      -1.0553625213260358

While Fastor pivot vector

[3]
[6]
[1]
[2]
[0]
[5]
[4]

and Julia pivot vector

7-element Array{Int64,1}:
 7
 3
 6
 1
 2
 5
 4

are different. So maybe your LU decomposition does not do a conventional LU or does not do what it is intended to do.

Same observation by using the Julia standard lu! function

┌ Info: A
│   A =
│    7×7 Array{Float64,2}:
│     -1.93782e-10  -3.27309e-10   0.0          -0.000110635  -4.81249e-11   0.0          0.0
│      1.89659e-10   3.20345e-10   0.0           0.000110635   9.97722e-5    0.0          0.0
│     -1.13501e-6    3.49486e-12   0.0           0.000203447   5.48017e-13   0.0         -1.0
│      5.54995e-8   -1.70891e-13   0.0          -5.22136e-8   -2.67969e-14  -5.2211e-8   -1.0
│      0.0           0.0           0.0           0.0           0.0           9.22111e-8  -1.0
│      5.98052e-18   3.08622e-12  -0.000111011   1.58109e-16   1.58398e-18   0.0          0.0
└     -3.61183e-6    4.03106e-5    0.000111011   1.37001e-9    1.3725e-11    0.0          0.0
┌ Info: L
│   l =
│    7×7 Array{Float64,2}:
│      1.0           0.0           0.0           0.0        0.0          0.0       0.0
│      0.314247      1.0           0.0           0.0        0.0          0.0       0.0
│     -1.65582e-12  -2.43639e-7    1.0           0.0        0.0          0.0       0.0
│      5.3652e-5     0.00019657   -8.11971e-6    1.0        0.0          0.0       0.0
│     -5.25105e-5   -0.000192388   7.94695e-6   -0.999991   1.0          0.0       0.0
│     -0.0          -0.0          -0.0          -0.0        0.0          1.0       0.0
└     -0.015366     -0.0488979    -1.08603e-18  -0.089414  -4.31286e-8  -0.566211  1.0
┌ Info: U
│   u =
│    7×7 Array{Float64,2}:
│     -3.61183e-6   4.03106e-5   0.000111011   1.37001e-9    1.3725e-11   0.0          0.0
│      0.0         -1.26675e-5  -3.4885e-5     0.000203446  -3.76504e-12  0.0         -1.0
│      0.0          0.0         -0.000111011   4.95675e-11   6.6669e-19   0.0         -2.43639e-7
│      0.0          0.0          0.0          -0.000110675  -4.81249e-11  0.0          0.00019657
│      0.0          0.0          0.0           0.0           9.97722e-5   0.0          4.18064e-6
│      0.0          0.0          0.0           0.0           0.0          9.22111e-8  -1.0
└      0.0          0.0          0.0           0.0           0.0          0.0         -1.61509
┌ Info: p
│   p =
│    7-element Array{Int32,1}:
│     7
│     3
│     6
│     1
│     2
│     5
└     4

I take a look to the difference in implementation right now

Check your final solution vector to see if you are actually getting the same results

OK. I think the solution is in @martinholters code comment:

based on Julia's generic_lufact!, but storing inverses on the diagonal; faster than calling out to dgetrf for sizes up to about 60×60

The only difference is here:

# Scale first column
fkkinv = factors[k,k] = inv(factors[k,k])

while initial implementation is:

# Scale first column
Akkinv = inv(A[k,k])

The diagonal become reciprocal of is value.

Also, I hope I am not daydreaming here, but a pivot is nothing but a permutation vector of rows (or columns) and you shouldn't get duplicated entries in a pivot vector (in your own case you do which is a bit odd).

I guess you're referring to ipiv. That's not a permutation vector as such. IIRC, it encodes the permutation as "for i from 1 to N, exchange entries (rows) i and ipiv[i]". Here, duplicate entries do make sense.

@martinholters it's me, i'm confuse. I talk about pivot indices but the variable name induce me in error. Sorry.

OK. I think the solution is in @martinholters code comment:

based on Julia's generic_lufact!, but storing inverses on the diagonal; faster than calling out to dgetrf for sizes up to about 60×60

Hah, right! Sometimes I actually do leave helpful comments :smile: Yes, the idea here is to avoid repeated reciprocal/division when solving for multiple RHSs.

Ah cool 😄 ! Now the big question is : is the change "conventional" in sense of @romeric question?

It's not conventional in how the result is stored: Instead of storing L (except for the implicit ones on the diagonal) and U, it stores U with the diagonal elements inverted. Of course, this is dealt with in the back-substitution by replacing a division in the conventional implementation with a multiplication. As long as the LU decomposition and the solving part agree on the storage format, that shouldn't make much difference, though. Likewise, different pivoting should only result in different numerical round-off noise which I don't expect to be problematic here.

OK. Thank's for the clarification Martin. So, this is not a possible method to be implemented inside Fastor, Therefore, there's two solutions. First one, to duplicate the current Fastor solver and modify it as the indication you provide. Another solution is to keep the current julia algorithm and attempt to adapt the code with the help of Fastor optimization. I need the @romeric expertize to know what to do, in particular on the potential gain because it would seem to be substantial.

As long as the LU decomposition and the solving part agree on the storage format, that shouldn't make much difference, though. Likewise, different pivoting should only result in different numerical round-off noise which I don't expect to be problematic here.

@dorpxam as long as your implementation is supposed to do a dgetrf based solve, the observable behaviour should be the same. Of course numerical round-off error will acrue and make two seemingly identical implementations diverge especially if you have an iterative (nonlinear) solver implemented on top of that, which you do, but not to the point where you'd get completely different results. In your specific case you are working with matrices that have a pretty high condition number as well.

If I were you I would stick with your own (Julia or Julia style) implementation and keep things at a high level using Fastor's tensors. Given that your matrix sizes are small and compile time constants the compiler will do a pretty good job optimising them. Fastor's internal solver routines (like most optimised BLAS routines) get pretty low-level to squeeze that last bit of performance. But that is going to be futile exercise for you here.

Also don't directly compare the result of a determinant product(diag(U)) to zero but check if it is close enough (you define what is an acceptable "close-enough" for your application).

HSU-ANT / ACME.jl

Proof of concept - C++ backend using Fastor Library #28