Replace parity backend with k256

This PR replaces the parity backend with one based on k256. You can find it here.

In general this provides a major performance improvement and a better implementation of the underlying arithmetic. The main advantage is that on 64 bit platforms it will now use u64 for the scalar and field arithmetic and so requires fewer operations. The only thing that is slower is doing single multiplications by G which is only about %20 slower. This is because k256 doesn't yet have pre-computed multiplication tables for G.

It is also much faster in all respects when compiled for stable. Compiling with --features nightly will still give a small performace boost in some areas but no where near as much as before where it was pretty much mandatory to use nightly to have decent performance.

--features nightly perf change

ecmult/scalar_mul_point:basepoint,secret
                        time:   [57.488 us 57.627 us 57.777 us]
                        change: [+35.394% +36.044% +36.663%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
ecmult/scalar_mul_point:basepoint,public
                        time:   [56.789 us 56.932 us 57.141 us]
                        change: [+28.193% +31.325% +34.086%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 7 outliers among 100 measurements (7.00%)
  3 (3.00%) high mild
  4 (4.00%) high severe
ecmult/scalar_mul_point:normal,secret
                        time:   [57.445 us 57.877 us 58.353 us]
                        change: [-56.774% -55.838% -55.054%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 11 outliers among 100 measurements (11.00%)
  3 (3.00%) high mild
  8 (8.00%) high severe
ecmult/scalar_mul_point:normal,public
                        time:   [57.560 us 57.796 us 58.061 us]
                        change: [-40.680% -40.257% -39.877%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe
ecmult/scalar_mul_point:jacobian,secret
                        time:   [56.949 us 57.093 us 57.252 us]
                        change: [-58.797% -58.457% -58.166%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe
ecmult/scalar_mul_point:jacobian,public
                        time:   [59.594 us 60.955 us 62.594 us]
                        change: [-40.363% -39.336% -38.075%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  4 (4.00%) high mild
  5 (5.00%) high severe

double_mul/double_mul:normal,public
                        time:   [96.791 us 97.047 us 97.334 us]
                        change: [-50.696% -50.390% -50.104%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe
double_mul/double_mul:normal,secret
                        time:   [95.982 us 96.342 us 96.708 us]
                        change: [-62.244% -61.975% -61.731%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe
double_mul/double_mul:basepoint,public
                        time:   [100.22 us 100.45 us 100.69 us]
                        change: [-9.7466% -9.1646% -8.6505%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  1 (1.00%) low mild
  4 (4.00%) high mild
  1 (1.00%) high severe
double_mul/double_mul:basepoint,secret
                        time:   [100.48 us 100.68 us 100.91 us]
                        change: [-42.482% -41.706% -41.057%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

stable perf change

tl;dr everything is more than twice as fast as before

ecmult/scalar_mul_point:basepoint,secret
                        time:   [57.133 us 57.321 us 57.565 us]
                        change: [-56.568% -56.391% -56.222%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild
ecmult/scalar_mul_point:basepoint,public
                        time:   [58.203 us 58.690 us 59.218 us]
                        change: [-57.578% -57.316% -57.047%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  1 (1.00%) high mild
  3 (3.00%) high severe
ecmult/scalar_mul_point:normal,secret
                        time:   [57.388 us 58.049 us 58.900 us]
                        change: [-59.145% -58.474% -57.812%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe
ecmult/scalar_mul_point:normal,public
                        time:   [56.891 us 57.060 us 57.222 us]
                        change: [-58.420% -58.021% -57.692%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild
ecmult/scalar_mul_point:jacobian,secret
                        time:   [57.055 us 57.325 us 57.659 us]
                        change: [-58.702% -57.976% -57.266%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) high mild
  7 (7.00%) high severe
ecmult/scalar_mul_point:jacobian,public
                        time:   [57.443 us 57.616 us 57.795 us]
                        change: [-57.807% -57.607% -57.421%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

double_mul/double_mul:normal,public
                        time:   [96.867 us 97.233 us 97.664 us]
                        change: [-64.966% -64.576% -64.184%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  2 (2.00%) high mild
  2 (2.00%) high severe
double_mul/double_mul:normal,secret
                        time:   [96.986 us 97.777 us 98.682 us]
                        change: [-64.695% -64.324% -64.026%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 7 outliers among 100 measurements (7.00%)
  2 (2.00%) high mild
  5 (5.00%) high severe
double_mul/double_mul:basepoint,public
                        time:   [98.092 us 98.674 us 99.399 us]
                        change: [-62.889% -62.533% -62.190%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe
double_mul/double_mul:basepoint,secret
                        time:   [97.568 us 98.128 us 98.809 us]
                        change: [-63.956% -63.677% -63.360%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 14 outliers among 100 measurements (14.00%)
  7 (7.00%) high mild
  7 (7.00%) high severe

Misc

~During this PR I had to reluctantly move back from subtle-ng to subtle and make sigma_fun depend on pre-release version of curve25519 dalek. This is because k256 indirectly depends on subtle and this is hard to change without a lot of effort.~

In the end I moved back to subtle-ng by stripping out the elliptic-curve dependency from the backend and only keeping the bits of the arithmetic that we need.

LLFourn / secp256kfun