AztecProtocol / barretenberg

Apache License 2.0
133 stars 81 forks source link

Memory issue with ACIR + Goblin #819

Open ledwards2225 opened 8 months ago

ledwards2225 commented 8 months ago

Background: As part of our 2023 goals, we hooked up Goblin to ACIR. This essentially meant constructing and verifying GUH proofs over acir-generated circuits, and also constucting and verifying ECCVM and Translator proofs for aribitrary ECC ops that were unrelated to the circuit in question. (This latter component was essentially there to work out the interfaces and have a proof of concept). This was encapsulated in a new method proveAndVerifyGoblin. At the end of 2023, PR #3636 had things working for only a small subset of the acir tests (only one of which was run on CI). A follow on PR #3757 made all of the acir tests pass, however, we observed intermittent and non-repeatable failures seemingly related to some kind of memory bug. The failures were reproducible within the same environment (mainframe or CI) but not across environments and were dependent on print statements and whether or not tests were run in sequence or not. This latter point was particularly odd since the manner in which the tests are run should make them completely isolated from one another (as opposed to running several tests in the same process in gtest, for example). The typical error was a "Trying to invert zero in the field", anecdotally in ZM for the Translator. A failure was never observed when running any test in isolation. An example stack trace from a failing test is provided at the bottom of this description.

The workaround was simply to remove the ECCVM/Translator portions from the testing. This is actually natural since in practice these Goblin components only come into play for recursion, not for single proof construction verification. Also, the ops being processed by ECCVM/Translator in each test were completely arbitrary. At the time of writing, we simply run all of the acir tests for Ultra Plonk and Goblin Ultra Honk (GUH).

Backtrace from a failing test: (Note: the failing test was consistent for a consistent code config but would change seemingly arbitrarily with an arbitrary code change. I would not expect to be able to reproduce the failure on this test in particular).

Testing signed_arithmetic... (lldb) target create "/mnt/user-data/luke/aztec-packages/barretenberg/cpp/build/bin/bb"
Current executable set to '/mnt/user-data/luke/aztec-packages/barretenberg/cpp/build/bin/bb' (x86_64).
(lldb) settings set -- target.run-args  "prove_and_verify_goblin" "-c" "/mnt/user-data/luke/.bb-crs" "-b" "./target/acir.gz"
(lldb) run
GUH verification SUCCEEDED
ECCVM: create_verifier
ECCVM: verify_proof
Translator: create_verifier
Translator: verify_proof
terminate called after throwing an instance of 'std::runtime_error'
  what():  Trying to invert zero in the field
Process 3939334 stopped and restarted: thread 1 received signal: SIGCHLD
Process 3939334 stopped and restarted: thread 1 received signal: SIGCHLD
Process 3939334 stopped
* thread #1, name = 'bb', stop reason = signal SIGABRT
    frame #0: 0x00007ffff7a0300b libc.so.6`__GI_raise(sig=<unavailable>) at raise.c:51:1
Process 3939334 launched: '/mnt/user-data/luke/aztec-packages/barretenberg/cpp/build/bin/bb' (x86_64)
(lldb) bt
* thread #1, name = 'bb', stop reason = signal SIGABRT
  * frame #0: 0x00007ffff7a0300b libc.so.6`__GI_raise(sig=<unavailable>) at raise.c:51:1
    frame #1: 0x00007ffff79e2859 libc.so.6`__GI_abort at abort.c:79:7
    frame #2: 0x00007ffff7dedee6 libstdc++.so.6`___lldb_unnamed_symbol7360 + 96
    frame #3: 0x00007ffff7dfff8c libstdc++.so.6`___lldb_unnamed_symbol7814 + 12
    frame #4: 0x00007ffff7dffff7 libstdc++.so.6`std::terminate() + 23
    frame #5: 0x000055555561afee bb`__clang_call_terminate + 14
    frame #6: 0x0000555555743a39 bb`barretenberg::field<barretenberg::Bn254FqParams>::invert(this=0x00007ffffffe4620) const at field_impl.hpp:370:24
    frame #7: 0x00005555556fbc47 bb`barretenberg::group_elements::element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>::operator barretenberg::group_elements::affine_element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>(this=0x00007ffffffe45e0) const at element_impl.hpp:65:18
    frame #8: 0x00005555556fb967 bb`barretenberg::group_elements::affine_element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params> operator*<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>(base=0x00005555584d6780, exponent=0x0000555558e379c0) at element.hpp:155:73
    frame #9: 0x0000555556ed803a bb`proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::batch_mul_native(points=size=260, scalars=size=260) at zeromorph.hpp:621:43
    frame #10: 0x0000555556ed7b99 bb`proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::compute_C_Z_x(f_commitments=size=94, g_commitments=size=86, C_q_k=size=15, rho=proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::FF @ 0x00007ffffffefd00, batched_evaluation=proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::FF @ 0x00007ffffffefd20, x_challenge=proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::FF @ 0x00007ffffffefd40, u_challenge=size=15, concatenation_groups_commitments=size=4) at zeromorph.hpp:609:20
    frame #11: 0x0000555556e8921d bb`std::array<barretenberg::group_elements::affine_element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>, 2ul> proof_system::honk::pcs::zeromorph::ZeroMorphVerifier_<curve::BN254>::verify<RefVector<barretenberg::group_elements::affine_element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>>, RefVector<barretenberg::group_elements::affine_element<barretenberg::field<barretenberg::Bn254FqParams>, barretenberg::field<barretenberg::Bn254FrParams>, barretenberg::Bn254G1Params>>, RefVector<barretenberg::field<barretenberg::Bn254FrParams>>, RefVector<barretenberg::field<barretenberg::Bn254FrParams>>, std::vector<barretenberg::field<barretenberg::Bn254FrParams>, std::allocator<barretenberg::field<barretenberg::Bn254FrParams>>>, std::shared_ptr<proof_system::honk::BaseTranscript>>(unshifted_commitments=0x00007fffffff41a8, to_be_shifted_commitments=0x00007fffffff4190, unshifted_evaluations=0x00007fffffff4178, shifted_evaluations=0x00007fffffff4160, multivariate_challenge=size=15, transcript=std::__shared_ptr<proof_system::honk::BaseTranscript, __gnu_cxx::_S_atomic>::element_type @ 0x0000555558596780, concatenation_group_commitments=size=4, concatenated_evaluations=size=4) at zeromorph.hpp:685:28
    frame #12: 0x0000555556e72922 bb`proof_system::honk::GoblinTranslatorVerifier::verify_proof(this=0x00007fffffffc3c0, proof=0x00007fffffffcad0) at goblin_translator_verifier.cpp:262:9
    frame #13: 0x00005555557500cb bb`barretenberg::Goblin::verify_for_acir(this=0x00007fffffffd600, proof=0x00007fffffffcaa0) const at goblin.hpp:236:70
    frame #14: 0x00005555556f9cf7 bb`barretenberg::Goblin::verify_proof(this=0x00007fffffffd600, proof=0x00007fffffffcca8) const at goblin.hpp:283:32
    frame #15: 0x00005555556f879c bb`acir_proofs::AcirComposer::verify_goblin_proof(this=0x00007fffffffcd00, proof=size=142392) at acir_composer.cpp:164:19
    frame #16: 0x000055555556d90d bb`proveAndVerifyGoblin(bytecodePath="./target/acir.gz", witnessPath="./target/witness.gz", recursive=false) at main.cpp:162:35
    frame #17: 0x0000555555570583 bb`main(argc=6, argv=0x00007fffffffe598) at main.cpp:468:20
    frame #18: 0x00007ffff79e4083 libc.so.6`__libc_start_main(main=(bb`main at main.cpp:434), argc=6, argv=0x00007fffffffe598, init=<unavailable>, fini=<unavailable>, rtld_fini=<unavailable>, stack_end=0x00007fffffffe588) at libc-start.c:308:16
    frame #19: 0x000055555556ae3e bb`_start + 46
ledwards2225 commented 8 months ago

Collecting more information that is possibly relevant: I saw this CI failure (trying to invert zero in the Translator composer tests) on an entirely unrelated branch. Re-running the test made it pass. This could be a coincidence but it is notable that the failure was again in the Translator and that it was non-repeatable.