This PR generates a separate precomputed table for the first five elements of the vector to optimize the MSM.
The motivation is optimizing go-ethereum use in Verkle Tries. In this case, we have a reasonable amount of commitment calculations that only have non-zero values for the first five elements. This happens because every trie key calculation falls in this case. Saying it differently, we need to do the MSM only for the first five elements of the vector on every trie access.
A wider table focused on these five elements provides a good tradeoff between extra memory (i.e: making this for the 256 elements would be too costly), and CPU speedup since we avoid half-field multiplications.
This has shown a 1.30x speedup in our go-ethereum Verkle Trie benchmark regarding CommitToPoly(...) CPU usage.
Now we generate two tables:
A 16-bit window table for the first five elements of the vector.
A 8-bit window for the rest of the elements. (as we had before)
The serialization and deserialization of the tables were also changed since now we can't make some assumptions about how many points are per table or window size.
I've also improved on some other fronts while being here:
Added more documentation for some public APIs explaining the serialization format.
Tighten up our tests for the precomputed tables for different numbers of points and random commitment checks between serialized and deserialized versions of the same points.
This PR generates a separate precomputed table for the first five elements of the vector to optimize the MSM.
The motivation is optimizing
go-ethereum
use in Verkle Tries. In this case, we have a reasonable amount of commitment calculations that only have non-zero values for the first five elements. This happens because every trie key calculation falls in this case. Saying it differently, we need to do the MSM only for the first five elements of the vector on every trie access.A wider table focused on these five elements provides a good tradeoff between extra memory (i.e: making this for the 256 elements would be too costly), and CPU speedup since we avoid half-field multiplications.
This has shown a 1.30x speedup in our
go-ethereum
Verkle Trie benchmark regardingCommitToPoly(...)
CPU usage.Now we generate two tables:
The serialization and deserialization of the tables were also changed since now we can't make some assumptions about how many points are per table or window size.
I've also improved on some other fronts while being here: