Improve fmpz_mod a bit and rewrite fmpz_lucas_chain

Fixes #1965 by using a Z/nZ implementation (nmod, mpn_mod or fmpz_mod) instead of direct fmpz arithmetic to compute the Lucas chain.

Listed below is the speedup to run fmpz_is_probabprime_BPSW(p) given p = nextprime(2^(b-1)).

There is a slight regression just above 1024 bits, presumably because fmpz_mod_mul currently has a bunch of overhead it shouldn't have.

I guess this can be made to go even faster if we implement Montgomery reduction.

b      old time   new time   speedup

100    9.66e-06  3.47e-06    2.784
110    1.08e-05  3.88e-06    2.784
121    1.16e-05  4.08e-06    2.843
133    1.98e-05  1.25e-05    1.584
146    2.17e-05  1.33e-05    1.632
160    2.35e-05  1.5e-05    1.567
176    2.61e-05  1.68e-05    1.554
193    3.73e-05  2.27e-05    1.643
212    3.7e-05  2.47e-05    1.498
233    4.09e-05  2.76e-05    1.482
256    4.29e-05  2.61e-05    1.644
281    5.73e-05  3.88e-05    1.477
309    6.36e-05  4.26e-05    1.493
339    8.12e-05  5.58e-05    1.455
372    8.75e-05  6.11e-05    1.432
409    0.000116  7.87e-05    1.474
449    0.000147  9.74e-05    1.509
493    0.00016  0.000108    1.481
542    0.000208  0.000144    1.444
596    0.000252  0.000197    1.279
655    0.000321  0.000247    1.300
720    0.000398  0.000308    1.292
792    0.000497  0.000379    1.311
871    0.000571  0.000468    1.220
958    0.000729  0.00058    1.257
1053    0.00101  0.00106    0.953
1158    0.00132  0.00142    0.930
1273    0.0016  0.00171    0.936
1400    0.00206  0.002    1.030
1540    0.00274  0.0027    1.015
1694    0.00334  0.00317    1.054
1863    0.00429  0.00411    1.044
2049    0.00557  0.00569    0.979
2253    0.00699  0.00655    1.067
2478    0.00891  0.00824    1.081
2725    0.0117  0.0108    1.083
2997    0.0148  0.0136    1.088
3296    0.0193  0.0179    1.078
3625    0.0251  0.0231    1.087
3987    0.0322  0.0301    1.070
4385    0.0419  0.0391    1.072
4823    0.0534  0.0501    1.066
5305    0.07  0.0644    1.087
5835    0.0896  0.0827    1.083
6418    0.117  0.107    1.093
7059    0.148  0.137    1.080

flintlib / flint

Improve fmpz_mod a bit and rewrite fmpz_lucas_chain #1997