Streaming Prover, reduce memory overhead of prover

HarryR commented 6 years ago

See:

There are optimisations which can be made which reduce the necessary amount of in-memory data required for the proving key.

1) Don't use Jacobian coordinates for the proving key while it's in-memory (removes the X coord), allocate X coord on stack when performing calculation 2) Use a directly accessible format which allows the proving key to be mmap'd and let the kernel figure it out. 3) Anything which is used independently and sequentially means the rest can be discarded:

The proving key consists of five parts - PK_A, PK_B, PK_C, PK_K, PK_H which are used independently and sequentially.

Load each one at a time, rather than all of them.

Either way, need to create a benchmark for prove which tracks the peak memory usage.

Any changes should avoid modifying libsnark directly, we still want to use it as a submodule.

With the mmap approach, if the offsets & sizes of sections for A, B, C, K and H are known ahead of time, then madvise can be used depending on the access pattern. If it's purely sequential then MADV_SEQUENTIAL can be used, however if it's at random then doing a pre-emptive sequential read into memory will offer a performance improvement.

Either way - disk reads vs memory usage trade-off again.

HarryR commented 6 years ago

After some testing it seems the disk load of the proving key is nearly instant, however when using the standard ostream and istream serialisation it takes a significant amount of time to read from the stringstream into the object.

Part of the problem is the BINARY_OUTPUT flag in bigint.cc makes the difference between loading from a string representation versus a binary representation. However, this flag already seems to be enabled by default!

I think if this could be sped up then the prove / verify time would be reduced significantly, and more importantly in relation to the number of circuits. How can it take 1 minute (after loading all the data from disk) to parse 2.5m lines? With a larger number of circuits this will take even longer...

HarryR commented 6 years ago

So:

alt_bn128_q_limbs is (alt_bn128_q_bitcoin+GMP_NUMB_BITS-1)/GMP_NUMB_BITS
bigint<n> has mp_limb_t data[n]
Fp_model has bigint<n> mont_repr where n is alt_bn128_q_limbs (for alt_bn128)
alt_bn128_Fq is Fp_model<alt_bn128_q_limbs, alt_bn128_modulus_q>
alt_bn128_g1 has X, Y and Z which are alt_bn128_Fq types.

None of the classes above have any other variables other than those noted.

So, assuming tight packing, an alt_bn128_g1 point should be sizeof(mp_limb_t) * n * 3 memory. Similarly alt_bn128_g2 should be sizeof(mp_limb_t) * n * 6. This means that a reinterpret_cast<alt_bn128_g1> against mp_limb_t[n*3] should be usable as an alt_bn128_g1 object without problems.

If these assumptions are true, then it should be possible to dump the proving key in a way which can be mmap'd. The problem is that, when doing this, the compiler may adjust for alignment (assuming mp_limb_t is native ptr size, it shouldn't), the problem is that {A,B,C}_query are of Boost::sparse_vector type, and {H,K}_query are of std::vector type.

Need to investigate the density of the {A,B,C}_query sparse vectors

HarryR commented 6 years ago

With the LongsightF gadget this seems like less of a priority as the serialised circuit is quite small compared to the full SHA256 merkle tree proof gadget.

HarryR / ethsnarks

Streaming Prover, reduce memory overhead of prover #3