performance drop down - Githubissues

videlec commented 4 years ago

As reported by W. Bruns using Normaliz with quadratic numbers, there is a huge performance drop down between

https://github.com/mkoeppe/e-antic
E_ANTIC_BRANCH=winfried
E_ANTIC_COMMIT=131ab56629cc8f66245fedf25d13850fe2f063e4

and

E_ANTIC_VERSION=0.1.3b0
E_ANTIC_URL="http://www.labri.fr/perso/vdelecro/e-antic/e-antic-${E_ANTIC_VERSION}.tar.gz"
E_ANTIC_SHA256=7bfe4aa926303b87a58a962535793bfadd9bbf0f7617e7be82e0f77e3351438e

w-bruns commented 4 years ago

I am adding the test programs.

Old (first e-antic version specified above):

#include <sstream>
#include <cstdlib>
#include "e-antic/renfxx.h"

#include <gperftools/profiler.h>

using namespace std;

int main(void) {

    ProfilerStart("renf_time_old.prof");

    istringstream is("min_poly (a2-2) embedding 1.4+/-0.1");
    renf_class NF;
    is >> NF;
    cout << NF << endl;

    renf_elem_class elem(NF.get_renf());    
    renf_elem_class squ(NF.get_renf());

    stringstream inout;
    inout >> set_renf(NF.get_renf());
    inout << "( 37*a - 4)";
    inout >> elem;

    for(long i=0;i< 100000000;++i){
        squ=elem*elem;        
    }

    cout << elem << endl;

    ProfilerStop();
}

New:

#include <sstream>
#include <cstdlib>
#include "e-antic/renfxx.h"

#include <gperftools/profiler.h>

using namespace std;

int main(void) {

    ProfilerStart("renf_time_new.prof");

    string mp_string="a^2-2";
    string emb_string="1.4+/-0.1";
    renf_class Renf(mp_string, "a", emb_string);;

    char *res, *res1;
    res = fmpq_poly_get_str_pretty(Renf.get_renf()->nf->pol, "a");
    res1 = arb_get_str(Renf.get_renf()->emb, 64, 0);
    cout << "min_poly "
        << "(" << res << ")"
        << " embedding " << res1 << endl
        << endl;
    flint_free(res);
    flint_free(res1);

    stringstream inout;
    Renf.set_istream(inout);

    inout << "( 37*a - 4)";
    renf_elem_class elem, squ;
    inout >> elem;

    for(long i=0;i< 100000000;++i){
        squ=elem*elem;
        // cout << i << endl;
    }

    cout << elem << endl;

    ProfilerStop();
}

w-bruns commented 4 years ago

Let me add some more information. The test program needs 12 sec with the old version on my PC and 20 sec with the new version. gperftools gives the following data, from which one can see that the new version spends much more time on Flint memory management.

Old:

Total: 1217 samples
     115   9.4%   9.4%      115   9.4% fmpz_mul
      87   7.1%  16.6%      310  25.5% arb_mul
      69   5.7%  22.3%       95   7.8% fmpz_mul_si
      66   5.4%  27.7%      471  38.7% _nf_elem_mul_red
      61   5.0%  32.7%       61   5.0% _fmpz_demote (inline)
      61   5.0%  37.7%       83   6.8% fmpz_submul_ui
      58   4.8%  42.5%      589  48.4% nf_elem_mul_red
      50   4.1%  46.6%       50   4.1% mag_fast_addmul (inline)
      42   3.5%  50.0%       53   4.4% arf_mul_rnd_down
      42   3.5%  53.5%       57   4.7% fmpz_set
      41   3.4%  56.9%       76   6.2% fmpz_addmul_ui
      41   3.4%  60.2%       41   3.4% mag_fast_mul (inline)
      37   3.0%  63.3%       37   3.0% mag_fast_init_set_arf (inline)
      35   2.9%  66.1%       35   2.9% mag_fast_add_2exp_si (inline)
      29   2.4%  68.5%       80   6.6% arb_set
...

New:

Total: 2066 samples
     249  12.1%  12.1%      249  12.1% _fmpz_demote
     127   6.1%  18.2%      127   6.1% fmpz_init@418a9a
     121   5.9%  24.1%      121   5.9% fmpz_swap
     111   5.4%  29.4%      490  23.7% _nf_elem_mul_red
      94   4.5%  34.0%       94   4.5% mag_fast_addmul (inline)
      93   4.5%  38.5%      220  10.6% nf_elem_init
      85   4.1%  42.6%      412  19.9% nf_elem_clear
      78   3.8%  46.4%      327  15.8% fmpz_clear
      75   3.6%  50.0%       75   3.6% fmpz_mul
      63   3.0%  53.0%      299  14.5% arb_mul
      56   2.7%  55.8%       56   2.7% fmpz_init@4134ab
      53   2.6%  58.3%       74   3.6% fmpz_mul_si
      53   2.6%  60.9%     1048  50.7% nf_elem_mul_red
      52   2.5%  63.4%       85   4.1% fmpz_set
...

w-bruns commented 4 years ago

I have repeated the experiment in degree 12:

min_poly (a^12 + 1*a^6 + 1*a^5 + 1*a^2 - 5) embedding [1.0350235612733869786695407243642552311 +/- 5.66e-38]

elem ( a^10-3a^9+a^6-a^5+a-22)

For these data the new version is slightly faster: 97 sec vs. 107 sec old.

videlec commented 4 years ago

On master, the C++ code definitely has an overhead problem. Here are two simple programs running multiplication in C and C++ (over quadratic numbers)

Compiling with

$ gcc b-mul_quad_wordsize.c -o b-mul_quad_wordsize_c -leantic -larb -lflint -lgmp
$ g++ b-mul_quad_wordsize.cpp -o b-mul_quad_wordsize_cpp -std=c++17 -leanticxx -leantic -larb -lflint -lgmp

I do obtain the following 7x slowdown between cpp vs c code

$ time -p ./b-mul_quad_wordsize_c -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 1.04
user 1.04
sys 0.00
$ time -p ./b-mul_quad_wordsize_cpp -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 6.86
user 6.84
sys 0.00

videlec commented 4 years ago

Same procedure with cubic fields

b-mul_cubic_wordsize.c

b-mul_cubic_wordsize.cpp

$ time -p ./b-mul_cubic_wordsize_c -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 2.05
user 2.05
sys 0.00
$ time -p ./b-mul_cubic_wordsize_cpp -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 7.50
user 7.49
sys 0.00

w-bruns commented 4 years ago

It is of course possible that the old version has the same overhead problem, except that it is not so heavy.

saraedum commented 4 years ago

I don't see such a big gap with your test programs @videlec. For the cubic case I get 4.80s vs 2.66s. Not great of course.

saraedum commented 4 years ago

The C++ code is expected to be a bit slower because it needs one or two extra assignments (depending on how you write it exactly). This accounts for about 9% of the C++ runtime ~but this is not the problem here~. (Btw., using some heavy machinery such as yap one can get rid of this extra assignments.)

saraedum commented 4 years ago

@videlec I think that your tests are not fair comparisons actually. The C++ code needs to do much more work. when you write z = x*y, the C equivalent would be

        // Use the copy constructor to create a temporary = x
        renf_elem_t tmp;
        renf_elem_init(tmp, nf);
        renf_elem_set(tmp, x, nf);
        // tmp *= y (omitting checks for parent compatibility)
        renf_elem_mul(tmp, tmp, x, nf);
        // z = tmp (omitting setting the parent here)
        renf_elem_set(z, tmp, nf);
        renf_elem_clear(tmp);

(at least after applying a small patch that is in https://github.com/videlec/e-antic/pull/76)

Changing the C benchmark to this, I get 4.84s for C++ and 4.24s for C.

saraedum commented 4 years ago

To improve this to the compact C level, I think we need to use something like yap which lets us get rid of the temporary variable.

saraedum commented 4 years ago

@videlec @w-bruns, the version of that branch at https://github.com/mkoeppe/e-antic seems to use ANTIC directly which uses unreleased features of FLINT. It's a bit more work to build this whole stack to reproduce this.

Could you try to see whether b-mul_cubic_wordsize.c shows a regression between these two releases as well?

w-bruns commented 4 years ago

@videlec @saraedum The combination of antic and the unreleased version of Flint is most likely the explanation. I could have seen this earlier. Sorry.

But it leaves the question whether it could make sense to import the critical Flint (or antic) functions.

Another question is whether it could make sense to go to the C level at some critical points in the Normaliz code. I see two of them: scalar products and elementary row transformations of matrices.

videlec commented 4 years ago

@w-bruns We could think about having more efficient linear algebra directly in antic.

w-bruns commented 4 years ago

@videlec But would this mean to link the uinreleased version of Flint? With Bill Hart I compared the 64 bit linear algebra of Normaliz and Flint. For this number type Flint is not an advantage, but antic could be better as soon as GMP integers come into play.

One remark on multiplication and subsequent reduction modulo the minimal polynomial: reduction is a linear map --- could it make sense to realize it by matrix multiplication? Which would mean to store this matrix with the number field. (Perhaps antic does this.)

videlec commented 4 years ago

Antic does not exactly do linear algebra (I guess that this would be inefficient). The basis is 1, a, a^2, ..., a^{n-1} and you need to reduce the monomials a^{i+j} coming from a product a^i a^j only when i+j >= n. These reductions are stored in antic data structure (depending on the degree).

videlec commented 4 years ago

@videlec I think that your tests are not fair comparisons actually. The C++ code needs to do much more work. when you write z = x*y, the C equivalent would be
        // Use the copy constructor to create a temporary = x
        renf_elem_t tmp;
        renf_elem_init(tmp, nf);
        renf_elem_set(tmp, x, nf);
        // tmp *= y (omitting checks for parent compatibility)
        renf_elem_mul(tmp, tmp, x, nf);
        // z = tmp (omitting setting the parent here)
        renf_elem_set(z, tmp, nf);
        renf_elem_clear(tmp);
(at least after applying a small patch that is in #76)

Changing the C benchmark to this, I get 4.84s for C++ and 4.24s for C.

Here I don't understand. Why z = x * y performs any allocation for z? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K); once and for all?

saraedum commented 4 years ago

Here I don't understand. Why z = x * y performs any allocation for z? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K); once and for all?

It does not allocate for z. It allocates for the result of x * y before it assigns to z.

w-bruns commented 4 years ago

... which is tmp.

videlec commented 4 years ago

Here I don't understand. Why z = x * y performs any allocation for z? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K); once and for all?

It does not allocate for z. It allocates for the result of x * y before it assigns to z.

Wouldn't it be possible to use z directly for that purpose? (for which allocation is already done)

saraedum commented 4 years ago

That's what yap can do. You'd need to know about the semantics of operator= and operator* to know that it's safe to make that optimization.

videlec commented 4 years ago

Is yap smart enough to deal with the standard add mul z += x * y? And more complicated arithmetic expressions?

saraedum commented 4 years ago

Yes but it's not trivial to set up.

saraedum commented 4 years ago

It's essentially an easier version of the example on https://www.boost.org/doc/libs/1_71_0/doc/html/yap.html.

w-bruns commented 4 years ago

How soes GMP deal with the same problem?

From my viewpoint, the most essential operation is

a +/-= b*c

where / indicates alternatives. A handy function call for it would be good.

Do we have

renf_elem_t renf_elem_class:: get_renf_elem_t() ?

videlec commented 4 years ago

https://github.com/videlec/e-antic/blob/master/e-antic/renfxx.h#L156

saraedum commented 3 years ago

Since the performance problem has been fixed/mitigated, I moved the proposal about the specialized operator to #209.

flatsurf / e-antic

performance drop down #74