argumentcomputer / zk-light-clients

A collection of ZK light client libraries for various blockchains.
21 stars 2 forks source link

EOF error in `proof-server` benchmarks #18

Open winston-h-zhang opened 2 months ago

winston-h-zhang commented 2 months ago

What I did

Followed the instructions to download Groth16 artifacts. Then in aptos/proof-server I run

RUST_LOG="info" GROTH16=1 RUN_SERIAL=1 RUSTFLAGS="-C target-cpu=native --cfg tokio_unstable" PRIMARY_ADDR="127.0.0.1:8080" SECONDARY_ADDR="127.0.0.1:8081" cargo +nightly bench --bench proof_server

Error

The bench runs for ~10 minutes, and then

[sp1] groth16 artifacts already seem to exist at /home/winston/.sp1/circuits/groth16/9f43e920. if you want to re-download them, delete the directory
[2024-06-12T21:47:19Z INFO  tracing::span] wrap_groth16;
21:48:13 DBG constraint system solver done nbConstraints=31672801 took=33397.754915
21:48:35 DBG prover done acceleration=none backend=groth16 curve=bn254 nbConstraints=31672801 took=21582.373202
21:48:35 DBG verifier done backend=groth16 curve=bn254 took=1.455834
ignoring uninitialized slice: Vars []frontend.Variable
ignoring uninitialized slice: Vars []frontend.Variable
ignoring uninitialized slice: Vars []frontend.Variable
21:48:35 DBG verifier done backend=groth16 curve=bn254 took=1.218702
[2024-06-12T21:48:35Z INFO  server_secondary] Proof generated. Serializing
[2024-06-12T21:48:35Z INFO  server_secondary] Sending proof to the primary server
[2024-06-12T21:48:35Z INFO  server_secondary] Proof sent
[2024-06-12T21:48:35Z INFO  server_primary] Proof received. Sending it to the primary server
[2024-06-12T21:48:35Z INFO  server_primary] Proof sent
Error: unexpected end of file

I believe the error is happening here: https://github.com/lurk-lab/zk-light-clients/blob/69890d34dd39ab13b15dfc2b5d8370ee08aa1637/aptos/proof-server/benches/proof_server.rs#L204

Machine

GCP sphinx-test-1

tchataigner commented 2 months ago

I believe @storojs72 ran into the same issue while working on testing Groth16. @wwared might know better but I think this had to do with the build assets for Groth16 proof generation that were overwritten by the two proof processes.

wwared commented 2 months ago

That's a good point, but I think @winston-h-zhang already had the sp1 artifacts previously generated and put in the right directory (the log includes that "[sp1] groth16 artifacts already seem to exist ..." message), which I think prevents that issue since it won't try to regenerate the artifacts, plus he ran with RUN_SERIAL=1 which should prevent the provers from running at the same time in parallel. Though it's possible it could have to do with the groth16 artifacts, I think it's something else (not sure what)

My current hunch is that this could maybe have to do with the TCP connection getting closed by the OS prematurely due to the proof taking too long to finish and no traffic flowing through the connection causing a timeout/keepalive check, but I'm not really sure, because all the connections (client and both servers) are all on localhost, which I think should prevent these issues (e.g. no routers or NAT going on that could aggressively try to prune inactive connections).

@winston-h-zhang can you run the following on the GCP sphinx-test-1 machine?

cat /proc/sys/net/ipv4/tcp_keepalive_time
cat /proc/sys/net/ipv4/tcp_keepalive_intvl

The default values of 7200 and 75 should not cause this issue, since the linux keepalive probing would only kick in after 7200 seconds (2 hours) of inactivity

winston-h-zhang commented 2 months ago

I have the default values of 7200 and 75.

I just updated to most recent dev and tried to run again, but this time I'm getting an earlier error on the primary server:

stderr: thread '<unnamed>' panicked at src/main.rs:93:10:
stderr: verify_by_hash: could not verify proof: Root hash mismatch. Expected root hash: c8e73c36e426b1cb9e732a40b51ad14c44f0397b27795ffa882748845dc525d0. Computed root hash: 59975e09cfcf6289a231e6e1dfcce0741d9796d8da19ec10ab41cb0cb5e2797d

Do I have to update some artifacts?

cc/ @wwared and @tchataigner

wwared commented 2 months ago

@winston-h-zhang: I managed to replicate both the EOF issue and the verify_by_hash issue, though I think the EOF issue might have been due to the verify_by_hash error and a leftover proof server.

Can you also try running pkill server_primary and pkill server_secondary, to make sure there's no leftover servers running in your machine?

@tchataigner is looking into the verify_by_hash failure and I'll help debug the EOF issue if I can replicate it after we fix the verify_by_hash failure

wwared commented 2 months ago

@winston-h-zhang: #50 should have updated the .bcs files fixing that issue. Can you try again with the code from the wwared/docs branch? (Or wait a while until we merge #49 into dev)

winston-h-zhang commented 2 months ago

I updated to newest dev and updated all the .bcs files. The newer errors have been fixed, but now I'm running into the original error again:

[2024-06-25T03:38:22Z WARN  sphinx_core::utils] fixed log2 rows can be potentially reduced: got 394238, expected 1048576
[sp1] plonk bn254 artifacts already seem to exist at /home/winston/.sp1/circuits/plonk_bn254/v1.0.0. if you want to re-download them, delete the directory
03:44:38 DBG constraint system solver done nbConstraints=56218745 took=13270.964617
03:47:50 DBG prover done backend=plonk curve=bn254 nbConstraints=56218745 took=239528.232008
03:47:50 DBG verifier done backend=plonk curve=bn254 took=2.093459
ignoring uninitialized slice: Vars []frontend.Variable
ignoring uninitialized slice: Vars []frontend.Variable
ignoring uninitialized slice: Vars []frontend.Variable
03:47:50 DBG verifier done backend=plonk curve=bn254 took=1.4661

Error: unexpected end of file
error: bench failed, to rerun pass `--bench proof_server`

Caused by:
  process didn't exit successfully: `/home/winston/example-zk-light-clients-internal/aptos/target/release/deps/proof_server-3cf9236cb2d41cf0 --bench` (exit status: 1)

cc/ @wwared