Closed videlec closed 3 years ago
I am adding the test programs.
Old (first e-antic version specified above):
#include <sstream>
#include <cstdlib>
#include "e-antic/renfxx.h"
#include <gperftools/profiler.h>
using namespace std;
int main(void) {
ProfilerStart("renf_time_old.prof");
istringstream is("min_poly (a2-2) embedding 1.4+/-0.1");
renf_class NF;
is >> NF;
cout << NF << endl;
renf_elem_class elem(NF.get_renf());
renf_elem_class squ(NF.get_renf());
stringstream inout;
inout >> set_renf(NF.get_renf());
inout << "( 37*a - 4)";
inout >> elem;
for(long i=0;i< 100000000;++i){
squ=elem*elem;
}
cout << elem << endl;
ProfilerStop();
}
New:
#include <sstream>
#include <cstdlib>
#include "e-antic/renfxx.h"
#include <gperftools/profiler.h>
using namespace std;
int main(void) {
ProfilerStart("renf_time_new.prof");
string mp_string="a^2-2";
string emb_string="1.4+/-0.1";
renf_class Renf(mp_string, "a", emb_string);;
char *res, *res1;
res = fmpq_poly_get_str_pretty(Renf.get_renf()->nf->pol, "a");
res1 = arb_get_str(Renf.get_renf()->emb, 64, 0);
cout << "min_poly "
<< "(" << res << ")"
<< " embedding " << res1 << endl
<< endl;
flint_free(res);
flint_free(res1);
stringstream inout;
Renf.set_istream(inout);
inout << "( 37*a - 4)";
renf_elem_class elem, squ;
inout >> elem;
for(long i=0;i< 100000000;++i){
squ=elem*elem;
// cout << i << endl;
}
cout << elem << endl;
ProfilerStop();
}
Let me add some more information. The test program needs 12 sec with the old version on my PC and 20 sec with the new version. gperftools gives the following data, from which one can see that the new version spends much more time on Flint memory management.
Old:
Total: 1217 samples
115 9.4% 9.4% 115 9.4% fmpz_mul
87 7.1% 16.6% 310 25.5% arb_mul
69 5.7% 22.3% 95 7.8% fmpz_mul_si
66 5.4% 27.7% 471 38.7% _nf_elem_mul_red
61 5.0% 32.7% 61 5.0% _fmpz_demote (inline)
61 5.0% 37.7% 83 6.8% fmpz_submul_ui
58 4.8% 42.5% 589 48.4% nf_elem_mul_red
50 4.1% 46.6% 50 4.1% mag_fast_addmul (inline)
42 3.5% 50.0% 53 4.4% arf_mul_rnd_down
42 3.5% 53.5% 57 4.7% fmpz_set
41 3.4% 56.9% 76 6.2% fmpz_addmul_ui
41 3.4% 60.2% 41 3.4% mag_fast_mul (inline)
37 3.0% 63.3% 37 3.0% mag_fast_init_set_arf (inline)
35 2.9% 66.1% 35 2.9% mag_fast_add_2exp_si (inline)
29 2.4% 68.5% 80 6.6% arb_set
...
New:
Total: 2066 samples
249 12.1% 12.1% 249 12.1% _fmpz_demote
127 6.1% 18.2% 127 6.1% fmpz_init@418a9a
121 5.9% 24.1% 121 5.9% fmpz_swap
111 5.4% 29.4% 490 23.7% _nf_elem_mul_red
94 4.5% 34.0% 94 4.5% mag_fast_addmul (inline)
93 4.5% 38.5% 220 10.6% nf_elem_init
85 4.1% 42.6% 412 19.9% nf_elem_clear
78 3.8% 46.4% 327 15.8% fmpz_clear
75 3.6% 50.0% 75 3.6% fmpz_mul
63 3.0% 53.0% 299 14.5% arb_mul
56 2.7% 55.8% 56 2.7% fmpz_init@4134ab
53 2.6% 58.3% 74 3.6% fmpz_mul_si
53 2.6% 60.9% 1048 50.7% nf_elem_mul_red
52 2.5% 63.4% 85 4.1% fmpz_set
...
I have repeated the experiment in degree 12:
min_poly (a^12 + 1*a^6 + 1*a^5 + 1*a^2 - 5) embedding [1.0350235612733869786695407243642552311 +/- 5.66e-38]
elem
( a^10-3a^9+a^6-a^5+a-22)
For these data the new version is slightly faster: 97 sec vs. 107 sec old.
On master, the C++ code definitely has an overhead problem. Here are two simple programs running multiplication in C and C++ (over quadratic numbers)
Compiling with
$ gcc b-mul_quad_wordsize.c -o b-mul_quad_wordsize_c -leantic -larb -lflint -lgmp
$ g++ b-mul_quad_wordsize.cpp -o b-mul_quad_wordsize_cpp -std=c++17 -leanticxx -leantic -larb -lflint -lgmp
I do obtain the following 7x slowdown between cpp vs c code
$ time -p ./b-mul_quad_wordsize_c -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 1.04
user 1.04
sys 0.00
$ time -p ./b-mul_quad_wordsize_cpp -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 6.86
user 6.84
sys 0.00
Same procedure with cubic fields
$ time -p ./b-mul_cubic_wordsize_c -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 2.05
user 2.05
sys 0.00
$ time -p ./b-mul_cubic_wordsize_cpp -n 10000000 -p 64
Running 10000000 multiplication with arb precision 64
real 7.50
user 7.49
sys 0.00
It is of course possible that the old version has the same overhead problem, except that it is not so heavy.
I don't see such a big gap with your test programs @videlec. For the cubic case I get 4.80s vs 2.66s. Not great of course.
The C++ code is expected to be a bit slower because it needs one or two extra assignments (depending on how you write it exactly). This accounts for about 9% of the C++ runtime ~but this is not the problem here~. (Btw., using some heavy machinery such as yap one can get rid of this extra assignments.)
@videlec I think that your tests are not fair comparisons actually. The C++ code needs to do much more work. when you write z = x*y
, the C equivalent would be
// Use the copy constructor to create a temporary = x
renf_elem_t tmp;
renf_elem_init(tmp, nf);
renf_elem_set(tmp, x, nf);
// tmp *= y (omitting checks for parent compatibility)
renf_elem_mul(tmp, tmp, x, nf);
// z = tmp (omitting setting the parent here)
renf_elem_set(z, tmp, nf);
renf_elem_clear(tmp);
(at least after applying a small patch that is in https://github.com/videlec/e-antic/pull/76)
Changing the C benchmark to this, I get 4.84s for C++ and 4.24s for C.
To improve this to the compact C level, I think we need to use something like yap which lets us get rid of the temporary variable.
@videlec @w-bruns, the version of that branch at https://github.com/mkoeppe/e-antic seems to use ANTIC directly which uses unreleased features of FLINT. It's a bit more work to build this whole stack to reproduce this.
Could you try to see whether b-mul_cubic_wordsize.c shows a regression between these two releases as well?
@videlec @saraedum The combination of antic and the unreleased version of Flint is most likely the explanation. I could have seen this earlier. Sorry.
But it leaves the question whether it could make sense to import the critical Flint (or antic) functions.
Another question is whether it could make sense to go to the C level at some critical points in the Normaliz code. I see two of them: scalar products and elementary row transformations of matrices.
@w-bruns We could think about having more efficient linear algebra directly in antic.
@videlec But would this mean to link the uinreleased version of Flint? With Bill Hart I compared the 64 bit linear algebra of Normaliz and Flint. For this number type Flint is not an advantage, but antic could be better as soon as GMP integers come into play.
One remark on multiplication and subsequent reduction modulo the minimal polynomial: reduction is a linear map --- could it make sense to realize it by matrix multiplication? Which would mean to store this matrix with the number field. (Perhaps antic does this.)
Antic does not exactly do linear algebra (I guess that this would be inefficient). The basis is 1, a, a^2, ..., a^{n-1}
and you need to reduce the monomials a^{i+j}
coming from a product a^i a^j
only when i+j >= n
. These reductions are stored in antic data structure (depending on the degree).
@videlec I think that your tests are not fair comparisons actually. The C++ code needs to do much more work. when you write
z = x*y
, the C equivalent would be// Use the copy constructor to create a temporary = x renf_elem_t tmp; renf_elem_init(tmp, nf); renf_elem_set(tmp, x, nf); // tmp *= y (omitting checks for parent compatibility) renf_elem_mul(tmp, tmp, x, nf); // z = tmp (omitting setting the parent here) renf_elem_set(z, tmp, nf); renf_elem_clear(tmp);
(at least after applying a small patch that is in #76)
Changing the C benchmark to this, I get 4.84s for C++ and 4.24s for C.
Here I don't understand. Why z = x * y
performs any allocation for z
? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K);
once and for all?
Here I don't understand. Why z = x * y performs any allocation for z? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K); once and for all?
It does not allocate for z
. It allocates for the result of x * y
before it assigns to z
.
... which is tmp.
Here I don't understand. Why z = x * y performs any allocation for z? Isn't the allocation supposed to be done in the declaration renf_elem_class x(K), y(K), z(K); once and for all?
It does not allocate for
z
. It allocates for the result ofx * y
before it assigns toz
.
Wouldn't it be possible to use z
directly for that purpose? (for which allocation is already done)
That's what yap can do. You'd need to know about the semantics of operator=
and operator*
to know that it's safe to make that optimization.
Is yap smart enough to deal with the standard add mul z += x * y
? And more complicated arithmetic expressions?
Yes but it's not trivial to set up.
It's essentially an easier version of the example on https://www.boost.org/doc/libs/1_71_0/doc/html/yap.html.
How soes GMP deal with the same problem?
From my viewpoint, the most essential operation is
a +/-= b*c
where / indicates alternatives. A handy function call for it would be good.
Do we have
renf_elem_t renf_elem_class:: get_renf_elem_t()
?
Since the performance problem has been fixed/mitigated, I moved the proposal about the specialized operator to #209.
As reported by W. Bruns using Normaliz with quadratic numbers, there is a huge performance drop down between
and