Improved final exponentiation efficiency

vbuterin commented 4 years ago

Take advantage of the linearity of x -> x**p to greatly improve efficiency of final exponentiation.

pipermerriam commented 4 years ago

Do you have any metrics/measurements on how this effects performance?

vbuterin commented 4 years ago

Quick test on my laptop shows it reduces the time of a final exponentiation from 1.10 sec to 0.34 sec.

Changed to your preferred zip format btw; to answer your performance question there, the difference between list and zip is negligible as the bulk of the complexity is in the field operations and not the scaffolding.

ChihChengLiang commented 4 years ago

Here's what I understand from this pull request so far. I haven't figured out why the linearity works.

TLDR

We are saving the cost from 6438 multiplications to 2000 multiplications with 1 division now.

Why is the final_exponentiate expensive?

In the final_exponentiate, we raise the input FQ12 to a large power of an integer. This requires many multiplications of FQ12.

def final_exponentiate_old(p: FQ12) -> FQ12:
    return p ** ((field_modulus ** 12 - 1) // curve_order)

How many multiplications?

The square-and-multiply strategy is used to raise the FQ12 to a power. Let's say if we want to calculate x^9, we (x^8) x, then we get x^8 by (x^4) (x^4), then get x^4 by (x^2) * (x^2), so on and so forth.

    def __pow__(self: T_FQ, other: int) -> T_FQ:
        if other == 0:
            return type(self)(1)
        elif other == 1:
            return type(self)(self.n)
        elif other % 2 == 0:
            return (self * self) ** (other // 2)
        else:
            return ((self * self) ** int(other // 2)) * self

To get the number of multiplications required for a certain power n, we can turn n into a binary. For every digit 0 in the binary, we need one multiplication for the square; and for every digit 1, we need one for the square and one for the multiply.

def multiplications_required(n):
    return sum(int(i) + 1 for i in bin(n)[2:])

To do the final exponentiate we need 6438 multiplications.

from py_ecc.fields.field_properties import field_properties
from py_ecc.optimized_bls12_381.optimized_curve import curve_order
field_modulus = field_properties["bls12_381"]["field_modulus"]

exp = ((field_modulus ** 12 - 1) // curve_order)
print(multiplications_required(exp))
# 6438

How exp_by_p saves the number of multiplications?

What exp_by_p brings to the table (sorry :)) is it makes raising to a power of field_modulus cheap. Naive square-and-multiply needs 611 multiplications but with a precomputed exptable only 12 multiplications are required now.

print(multiplications_required(field_modulus))
# 611

How to use as many exp_by_p as possible?

Note that x^12 - 1 = (x^6 +1)(x^6 - 1) = (x^2 +1)(x^4 - x^2 + 1)(x^6 - 1)

The variable p2 gets us p**(x^2 +1), the p3 gets us p2 **(x^6 - 1), and the last expression gets p3**(x^4 - x^2 + 1)

def final_exponentiate(p: FQ12) -> FQ12:
    cofactor = (field_modulus ** 4 - field_modulus ** 2 + 1) // curve_order
    p2 = exp_by_p(exp_by_p(p)) * p
    p3 = exp_by_p(exp_by_p(exp_by_p(exp_by_p(exp_by_p(exp_by_p(p2)))))) / p2
    return p3 ** cofactor

How many multiplications now?

We have 8 exp_by_p, so 8 * 12 = 96 multiplications. Raising to the power of the cofactor needs 1901 multiplications, which can be shown by running the script.

cofactor = (field_modulus ** 4 - field_modulus ** 2 + 1) // curve_order
print(multiplications_required(cofactor))
# 1901

We are also having 1 division that's slightly more expensive than 1 multiplication but negligible here. In total, we have almost 2k multiplications.

Do we have a programmable way to verify those number?

Here's a script

from py_ecc.fields.field_properties import field_properties
from py_ecc.fields import optimized_bls12_381_FQ12 as FQ12
from py_ecc.optimized_bls12_381.optimized_curve import curve_order
from py_ecc.optimized_bls12_381.optimized_pairing import final_exponentiate
import py_ecc.optimized_bls12_381.optimized_pairing

field_modulus = field_properties["bls12_381"]["field_modulus"]

def tick():
    tick.counter += 1

# initialize
tick.counter = 0

class WrapFQ12(FQ12):
    def __mul__(self, *args):
        tick()
        return super().__mul__(*args)

    def __rmul__(self, *args):
        tick()
        return super().__rmul__(*args)

py_ecc.optimized_bls12_381.optimized_pairing.exptable = [
    WrapFQ12([0] * i + [1] + [0] * (11 - i)) ** field_modulus for i in range(12)]

# reset
tick.counter = 0

a = WrapFQ12([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])
final_exponentiate(a)
print("Optimized ", tick.counter)

# ------------------------
tick.counter = 0

def final_exponentiate_old(p):
    return p ** ((field_modulus ** 12 - 1) // curve_order)

final_exponentiate_old(a)
print(tick.counter)

vbuterin commented 4 years ago

Great work on the explanation! :rofl: I agree with all of it.

Here's what I understand from this pull request so far. I haven't figured out why the linearity works.

https://en.wikipedia.org/wiki/Freshman%27s_dream

ChihChengLiang commented 4 years ago

Thanks 😂

Here's my understanding of the Freshman's dream

Let's denote a Fq12 element x = (x1, x2, ..., x12), where x1, ..., x12 are Fq we can represent x as a linear combination of the bases

x = x1 * (1, 0, ..., 0) + x2 * (0, 1, 0, ..., 0) + ... + x12 * (0, ...., 0, 1)

or we denote a basis as bi

x = x1 * b1 + x2 * b2 + ... + x12 * b12

Applying Freshman's dream (it's applicable because we have prime order and prime power?)

x^p =  (x1 * b1)^p + (x2 * b2)^p + ... + (x12 * b12)^p
    = x1^p * b1^p + x2^p * b2^p + ... + x12^p * b12^p
    = x1 * b1^p + x2 * b2^p + ... + x12 * b12^p

Where all bi^p are the FQ12 values precomputed in the exptable. We also have xi^p = xi by Fermat's little theorem.

vbuterin commented 4 years ago

Yep! 100% correct.

Except this:

are the FQ2 values

FQ12 values.

ChihChengLiang commented 4 years ago

Ah, nice catch! Fixed.

pipermerriam commented 4 years ago

@vbuterin is there anything else you want to do with this PR prior to merge?

ethereum / py_ecc