JeanLucPons / Kangaroo

Pollard's kangaroo for SECPK1
GNU General Public License v3.0
246 stars 177 forks source link

Supporting Elliptic Curve Order Multiplication Calculations #112

Open edsky opened 1 year ago

edsky commented 1 year ago

I need to perform order multiplication calculations on elliptic curves in CUDA. Currently, there is a ModMult function in CUDA as shown below, but it performs multiplication mod P, not order N. How should I modify it to calculate multiplication mod order N?

__device__ void _ModMult(uint64_t *r, uint64_t *a, uint64_t *b)
{

    uint64_t r512[8];
    uint64_t t[NBBLOCK];
    uint64_t ah, al;

    r512[5] = 0;
    r512[6] = 0;
    r512[7] = 0;

    // 256*256 multiplier
    UMult(r512, a, b[0]);
    UMult(t, a, b[1]);
    UADDO1(r512[1], t[0]);
    UADDC1(r512[2], t[1]);
    UADDC1(r512[3], t[2]);
    UADDC1(r512[4], t[3]);
    UADD1(r512[5], t[4]);
    UMult(t, a, b[2]);
    UADDO1(r512[2], t[0]);
    UADDC1(r512[3], t[1]);
    UADDC1(r512[4], t[2]);
    UADDC1(r512[5], t[3]);
    UADD1(r512[6], t[4]);
    UMult(t, a, b[3]);
    UADDO1(r512[3], t[0]);
    UADDC1(r512[4], t[1]);
    UADDC1(r512[5], t[2]);
    UADDC1(r512[6], t[3]);
    UADD1(r512[7], t[4]);

    // Reduce from 512 to 320
    UMult(t, (r512 + 4), 0x1000003D1ULL);
    UADDO1(r512[0], t[0]);
    UADDC1(r512[1], t[1]);
    UADDC1(r512[2], t[2]);
    UADDC1(r512[3], t[3]);

    // Reduce from 320 to 256
    UADD1(t[4], 0ULL);
    UMULLO(al, t[4], 0x1000003D1ULL);
    UMULHI(ah, t[4], 0x1000003D1ULL);
    UADDO(r[0], r512[0], al);
    UADDC(r[1], r512[1], ah);
    UADDC(r[2], r512[2], 0ULL);
    UADD(r[3], r512[3], 0ULL);
}
OlivierGaland commented 1 year ago

Hello,

I'm not a developper of the package, I'm not sure why you need to perform modulus using order but this could be a hint to adapt the code :

Check the 0x1000003D1ULL in the function, it is linked to prime with this operation :

hex(2**256-0x1000003D1) '0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f'

You may try to replace that value with hex(2**256-order), it will be slightly bigger so I don't know if the operations with that value will be longer valid (either on the result, or in the cpu time).

edsky commented 1 year ago

Hello,

I'm not a developper of the package, I'm not sure why you need to perform modulus using order but this could be a hint to adapt the code :

Check the 0x1000003D1ULL in the function, it is linked to prime with this operation :

hex(2**256-0x1000003D1) '0xfffffffffffffffffffffffffffffffffffffffffffffffffffffffefffffc2f'

You may try to replace that value with hex(2**256-order), it will be slightly bigger so I don't know if the operations with that value will be longer valid (either on the result, or in the cpu time).

Hello, I obtained a number 'a' between 2^128 and 2^129 by taking the modulus of 2^256. Following this principle, I divided a large number of 512 into two numbers of 256. Then, after multiplying the high bit by 'a', I got a number of 2^385. Then, I recursively applied this multiplication (until the high bit is zero) to return a number within 256, and the final result is the number obtained by adding this to the low bit number.

__device__ void ModMultOrder(uint64_t *r, uint64_t *a)
{
    uint64_t r512[8];
    uint64_t t[NBBLOCK];
    r512[4] = 0;
    r512[5] = 0;
    r512[6] = 0;
    r512[7] = 0;

    // 256*256 multiplier
    UMult(r512, a, r[0]);
    UMult(t, a, r[1]);
    UADDO1(r512[1], t[0]);
    UADDC1(r512[2], t[1]);
    UADDC1(r512[3], t[2]);
    UADDC1(r512[4], t[3]);
    UADD1(r512[5], t[4]);
    UMult(t, a, r[2]);
    UADDO1(r512[2], t[0]);
    UADDC1(r512[3], t[1]);
    UADDC1(r512[4], t[2]);
    UADDC1(r512[5], t[3]);
    UADD1(r512[6], t[4]);
    UMult(t, a, r[3]);
    UADDO1(r512[3], t[0]);
    UADDC1(r512[4], t[1]);
    UADDC1(r512[5], t[2]);
    UADDC1(r512[6], t[3]);
    UADD1(r512[7], t[4]);

    // Reduce from 512 to 257
    Load256(t, (&r512[4]));
    if (!_IsZero(t)) {
        ModMultOrder(t, N);
        ModAdd256Order(t, r512, t);
        Load256(r, t);
    } else {
        Load256(t, r512);
        SubOrder(t);
        if (_IsPositive(t)) {
            Load256(r, t);
        } else {
            Load256(r, r512);
        }
    }
}

As seen in the code, most numbers work normally, but the number of recursions is a bit problematic and needs further optimization.