Speeding up mulE - Githubissues

Bulat-Ziganshin commented 7 years ago

mulE description says return a*exp[b] over GF(2^r) so it's a simple multiplication in GF(2^r) optimizable with PSHUFB. they just precompute logarithms of twiddle factors to speed up their implementation

catid commented 7 years ago

I sent an Email to the authors and they have been responding actually! You're totally right; I missed that the operation looks like a multiplication. But the problem is still that we can't use 4-bit table lookups to perform the operation:

For a normal GF(2^^16) I can do this successfully:

const uint16_t yy = GF256Ctx.GF256_LOG_TABLE[y];
for (int i = 0; i < bytes; ++i)
{
    const uint16_t xx1 = GF256Ctx.GF256_LOG_TABLE[x[i] & 0x0f];
    uint8_t value1 = GF256Ctx.GF256_EXP_TABLE[xx1 + yy];
    const uint16_t xx2 = GF256Ctx.GF256_LOG_TABLE[x[i] & 0xf0];
    uint8_t value2 = GF256Ctx.GF256_EXP_TABLE[xx2 + yy];

    z[i] ^= value1;
    z[i] ^= value2;
}

But the same does not work for mulE:

    const GFSymbol sum1 = static_cast<GFSymbol>(AddModQ(GFLog[a & 0x00ff], z));
    GFSymbol value1 = GFExp[sum1];
    const GFSymbol sum2 = static_cast<GFSymbol>(AddModQ(GFLog[a & 0xff00], z));
    GFSymbol value2 = GFExp[sum2];

    vx[i] ^= value1;
    vx[i] ^= value2;

This causes the decoding to fail... I'm stuck =(

Bulat-Ziganshin commented 7 years ago

it SHOULD work, we just need to pint out the culprit.

instead of checking whether decoding work you can print immediate computation results and compare them for original mulE and any modification. or just check that modified mulE returns the same result as original one for any x,y pair
fast phusfb-based multiplication in GF(65536) was employed by MultiPar, RAR5 and GF-Complete (and described in "screaming fast galois field arithmetic" paper). afair the point is that you split each number into 4 4-bit parts, but you better to look into paper
the first code you provided is clearly GF(256). i can't find bugs in the second code, but i just don't remember details of GF(2^n) operation

Bulat-Ziganshin commented 7 years ago

you can find gf-complete at http://jerasure.org/ , on github and bitbucket

Bulat-Ziganshin commented 7 years ago

i still don't understand - is mulE just a simple multiplication, or something else?

the formula exp[(log[a]+b &mod) + (log[a]+b >>len)] is equivalent to exp( (log(a)+b) mod (size-1)) and it seems proper code for a*exp[b] in GF(size) since 2^(size-1)==1

@SianJhengLin, can you clarify that mulE is indeed a*exp[b] in choosen field? in that case, it doesn't need any optimization on your side - the pshufb technique is directly applicable to this code

SianJhengLin commented 7 years ago

It is true that mulE returns a*exp[b] over GF(2^r). However, the elements of GF(2^r) are represented in a special basis, rather than the monomial basis. Hence the conventional field multiplications cannot be applied directly.

Bulat-Ziganshin commented 7 years ago

even if mathematically it's diferent, algorithmically it seems the same as ordinal multiplication, may be just with different log/exp table - unless i misunderstand something. let's see: for multiplication of non-zero values you use formula

   a*b = exp( log(a) + log(b) )

also, you reduce log(a)+log(b) modulo size-1 since 2^(size-1)=1

i think that it can be implemented using existing GF multiplication code, may be with different exp/log tables

Chris, may be you can try to apply your GF(256) code with their exp/log tables? And start with GF(256) implementation to simplify code transformation (instead of GF(65536))

catid commented 7 years ago

Okay Dr. Lin figured it out. We just have to change the basis from Cantor to monomial:

GFSymbol kGFBasis[kGFBits] = { 1, 2, 4, 8, 16, 32, 64, 128 // 1, 214, 152, 146, 86, 200, 88, 230 };

And now this works:

    const GFSymbol sum = static_cast<GFSymbol>(AddModQ(GFLog[a], z));
    GFSymbol value = GFExp[sum];

    GFSymbol sum1 = static_cast<GFSymbol>(AddModQ(GFLog[a & 0x0f], z));
    GFSymbol value1 = GFExp[sum1];
    if ((a & 0x0f) == 0)
    {
        value1 = 0;
    }
    GFSymbol sum2 = static_cast<GFSymbol>(AddModQ(GFLog[a & 0xf0], z));
    GFSymbol value2 = GFExp[sum2];
    if ((a & 0xf0) == 0)
    {
        value2 = 0;
    }

    LHC_DEBUG_ASSERT(value == (value1 ^ value2));

I'm going to resume work on the codec using normal GF mul ops =)

Bulat-Ziganshin commented 7 years ago

i think you are missed the whole point - we use Cantor basis because it's required for so-called additive FFT. Of course, multiplication will work correctly in monomial basis, but not additive FFT. At least i think so - if it was possible to implement additive FFT in monomial basis, the "Novel basis" paper authors can skip most part of their work, significantly simplifying the entire algorithm and math behind it

catid commented 7 years ago

Cantor basis is working too. The important fix was the two extra if-statements that map LUT[0] to 0.

Bulat-Ziganshin commented 7 years ago

oh, afair avoiding check for zero require to double the table size. it was done in GF(256) code from your libraries but not in the "novel polynomial" code

catid commented 7 years ago

Yeah I forgot about that ^_^;

catid / leopard

Speeding up mulE #1