Secp256k1 library in pure assembly

I suggest increasing the speed of the secp256k1 in Kangaroo. I found this: https://github.com/piggypiggy/secp256k1-x64 This library aims to provide the most efficient implementation of secp256k1 curve arithmetic. For example, function secp256k1_sqr_mont (the fastest, according to the developer) i suggest to use to calculate PubKey. How to implement this? All functions implemented in assembly:

Most of below mentioned functions preserve the property of inputs
being fully reduced, i.e. being in [0, modulus) range. Simply put if
inputs are fully reduced, then output is too. Note that reverse is
not true, in sense that given partially reduced inputs output can be
either, not unlikely reduced. And "most" in first sentence refers to
the fact that given the calculations flow one can tolerate that
addition, 1st function below, produces partially reduced result if
multiplications by 2 and 3, which customarily use addition, fully
reduce it. This effectively gives two options: a) addition produces
fully reduced result [as long as inputs are, just like remaining
functions]; b) addition is allowed to produce partially reduced
result, but multiplications by 2 and 3 perform additional reduction
step. Choice between the two can be platform-specific, but it was a)
in all cases so far...

/ Modular add: res = a+b mod P asm/ X64_EXPORT void secp256k1_add(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS], const BN_ULONG b[P256_LIMBS]); / Modular mul by 2: res = 2a mod P asm/ X64_EXPORT void secp256k1_mul_by_2(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS]); / Modular mul by 3: res = 3a mod P asm/ X64_EXPORT void secp256k1_mul_by_3(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS]); / Modular div by 2: res = a/2 mod P asm/ X64_EXPORT void secp256k1_div_by_2(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS]); / Modular sub: res = a-b mod P asm/ X64_EXPORT void secp256k1_sub(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS], const BN_ULONG b[P256_LIMBS]); / Modular neg: res = -a mod P asm/ X64_EXPORT void secp256k1_neg(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS]); / res = a mod P asm/ X64_EXPORT void secp256k1_reduce(BN_ULONG res[4], BN_ULONG a[P256_LIMBS]); / res = aw mod P asm/ X64_EXPORT void secp256k1_mul_word(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS], const BN_ULONG w); / Montgomery mul: res = ab2^-256 mod P asm/ X64_EXPORT void secp256k1_mul_mont(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS], const BN_ULONG b[P256_LIMBS]); / Montgomery sqr: res = aa2^-256 mod P asm/// - FASTEST X64_EXPORT void secp256k1_sqr_mont(BN_ULONG res[P256_LIMBS], const BN_ULONG a[P256_LIMBS]); / Convert a number from Montgomery domain, by multiplying with 1 asm/ X64_EXPORT void secp256k1_from_mont(BN_ULONG res[P256_LIMBS], const BN_ULONG in[P256_LIMBS]); / Convert a number to Montgomery domain, by multiplying with 2^512 mod asmP/ X64_EXPORT void secp256k1_to_mont(BN_ULONG res[P256_LIMBS], const BN_ULONG in[P256_LIMBS]); / Functions that perform constant time access to the precomputed tables asm/ X64_EXPORT void secp256k1_scatter_w5(POINT256 val, const POINT256 in_t, int idx); X64_EXPORT void secp256k1_scatter_w7(POINT256_AFFINE val, const POINT256_AFFINE in_t, int idx); / compare two points, 0 : a = b, -1 : a != b asm/ X64_EXPORT int secp256k1_point_cmp(const POINT256 a, const POINT256 b); X64_EXPORT void secp256k1_point_dbl(POINT256 r, const POINT256 a); X64_EXPORT void secp256k1_point_add(POINT256 r, const POINT256 a, const POINT256 b); X64_EXPORT void secp256k1_point_add_affine(POINT256 r, const POINT256 a, const POINT256_AFFINE *b);

JeanLucPons / Kangaroo

Secp256k1 library in pure assembly #124