Get realistic benchmarks into uber-calc (by Aug 2)

Description

Notes:

@laser will take over this project from @dignifiedquire
@dignifiedquire will provide code pointers and some prose explaining what his current solution does
@porcuquine to describe the delta between what dig's current solution does and what he needs
@laser will write acceptance criteria from ^^ and, if needed, more stories

Acceptance criteria

[ ] get ZigZag and micro benchmark output as JSON
[ ] ZigZag benchmark output should include max resident set size used
[ ] benchmark output should include Git SHA
[ ] benchmark output should include information about the hardware (CPU model, total memory, etc.)
[ ] single command for developer to run benchmarks in the cloud (on a Packet box)
[ ] run benchmarks on Packet box with each push to rust-proofs master
[ ] push benchmark JSON to benchmark server on each run

Risks + pitfalls

Where to begin

Overview

Data collection

The data points below can currently be collected using two different scripts

filbase --benchy collects information on ZigZag (rust-filbase)
cargo run --bin micro -p fil-proofs-tooling --release collects the micro benchmarks (rust-fil-proofs)

Data storage

The current solution can store the information into prometheus. This might or might not be a good idea. An alternative that is possible much simpler and more flexible in the long run is to store the data into a sql or nosql database.

Reporting

Visual: Initial prototype used Graphana, but this might not be that great, given certain limitations I have run into. Needs some exploration
Machine readable: The hard requirement right now is to have a JSON file with the data points, that can be consumed by @porcuquine

Configuration

rust-filbase has some flags on the commandline and we have a configuration file we can use to handle storage-proofs specific configs. In the long run that should likely hold all configurations, but needs more thought over the exact things that need configuration.

Execution

Manual runs,
- triggered via ssh
- triggered via comment on a PR
Automated runs for
- every commit to master of rust-fil-proofs

Data Points needed

ZigZag

Parameters

sector-size
- Size of one sector. Unit: bytes
base-degree
- In-degree of the base depth-robust graph (DRG).
expansion-degree
- Maximum in-degree of the bipartite expander graph component of a ZigZag graph.
sloth-iter
- Number of iterations of sloth verifiable delay encoding (VDE) to perform.
partitions
- Number of circuit partitions into which a proof is divided.
total-challenges
partition-challenges

Measurements

replication-time
- Time to replicate one sector. Unit: seconds
sealing-time
- Total CPU time to seal (replicate + generate proof of replication) one sector. Unit: seconds
non-circuit-proving-time
- Time to generate a non-circuit proof of replication. Unit: seconds
vector-commitment-time
- Time to generate the vector commitments used in a non-circuit proof of replication. Unit: seconds
circuit-proving-time-per-constraint
- Groth16 circuit proving time (from benchmarks) per constraint. Unit: seconds
circuit-proving-time
- Time to generate a circuit proof of replication using Groth16. Unit: seconds
zigzag-total-proving-time
- Total time to generate a proof of replication (circuit and non-circuit). Unit: seconds
seal-time
- Total time to seal (replication + proving) one sector. Unit: seconds
total-zigzag-constraints
- Total number of constraints which must be verified in a ZigZag circuit.
replication-cycles
sealing-cycles
zigzag-vanilla-proving-cycles
zigzag-groth-proving-cycles
zigzag-total-proving-cycles
zigzag-constraints
total-seal-cycles

Micro Benchmarks

single-kdf-time
- Hashing time to perform a single KDF. Unit: seconds
single-layer-merkle-hashing-time
- Merkle hashing time for a single layer. Unit: seconds
single-node-encoding-time
- Time to encode a single node. Unit: seconds
Conversion of 32 bytes to Fr.
Hash Functions
- input sizes
- 1x32
- 2x32
- number of layers * 32
- (parents + 1) * 32
- functions
- pedersen
- blake2s
- sha256

Other Details to collect

git commit sha
basic cpu and memory stats of the machine (possible crate: https://github.com/heim-rs/heim)

Notes

Using Benchy with ZigZag

Running

From rust-filbase root:

cargo build --release --features benchy \
  && ./target/release/filbase benchy zigzag --size 1024

Output

Replication: total time: 14.9740s
Replication: time per byte: 14.2800us
Vanilla proving: 533.4730us
Avg verifying: 0.3421s
Total proving: 0.0000s
# HELP circuit_num_constraints Number of constraints of the circuit
# TYPE circuit_num_constraints gauge
circuit_num_constraints{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 0
# HELP circuit_num_inputs Number of inputs to the circuit
# TYPE circuit_num_inputs gauge
circuit_num_inputs{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 0
# HELP replication_time_ms Total replication timea
# TYPE replication_time_ms gauge
replication_time_ms{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 14974
# HELP replication_time_ns_per_byte Replication time per byte
# TYPE replication_time_ns_per_byte gauge
replication_time_ns_per_byte{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 14280
# HELP vanilla_proving_time_us Vanilla proving time
# TYPE vanilla_proving_time_us gauge
vanilla_proving_time_us{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 533
# HELP vanilla_verification_time_us Vanilla verification time
# TYPE vanilla_verification_time_us gauge
vanilla_verification_time_us{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 324979

Capturing Micro-Benchmarks

Running

From rust-proofs root:

cargo build --release --all \
  && cargo run --bin micro -p fil-proofs-tooling --release

Output

# HELP time_gauge_us time gauge help
# TYPE time_gauge_us gauge
time_gauge_us{name="bytes-32-to-fr"} 0.049158
time_gauge_us{name="encode-node/blake2s/10"} 0.86665
time_gauge_us{name="encode-node/blake2s/3"} 0.53398
time_gauge_us{name="encode-node/blake2s/5"} 0.59715
time_gauge_us{name="encode-node/pedersen/10"} 0.84397
time_gauge_us{name="encode-node/pedersen/3"} 0.38849
time_gauge_us{name="encode-node/pedersen/5"} 0.48761
time_gauge_us{name="encode-node/sha256/10"} 0.8845599999999999
time_gauge_us{name="encode-node/sha256/3"} 0.39442
time_gauge_us{name="encode-node/sha256/5"} 0.50918
time_gauge_us{name="fr-to-bytes-32"} 0.092348
time_gauge_us{name="hash-blake2s-circuit/create-proof"} 307720
time_gauge_us{name="hash-blake2s-circuit/synthesize"} 0.58429
time_gauge_us{name="hash-blake2s/non-circuit/32"} 0.13493
time_gauge_us{name="hash-blake2s/non-circuit/320"} 0.47986
time_gauge_us{name="hash-blake2s/non-circuit/64"} 0.12398
time_gauge_us{name="hash-pedersen-circuit/create-proof"} 37799
time_gauge_us{name="hash-pedersen-circuit/synthesize"} 1434.7
time_gauge_us{name="hash-pedersen/non-circuit/32"} 18.722
time_gauge_us{name="hash-pedersen/non-circuit/320"} 397.31
time_gauge_us{name="hash-pedersen/non-circuit/64"} 34.457
time_gauge_us{name="hash-sha256-circuit/create-proof"} 288300
time_gauge_us{name="hash-sha256-circuit/synthesize"} 30697
time_gauge_us{name="hash-sha256/non-circuit/32"} 0.34006000000000003
time_gauge_us{name="hash-sha256/non-circuit/320"} 1.6986
time_gauge_us{name="hash-sha256/non-circuit/64"} 0.61475
time_gauge_us{name="kdf/blake2s/10"} 0.75461
time_gauge_us{name="kdf/blake2s/3"} 0.30211
time_gauge_us{name="kdf/blake2s/5"} 0.41220999999999997
time_gauge_us{name="kdf/pedersen/10"} 0.7060700000000001
time_gauge_us{name="kdf/pedersen/3"} 0.25906999999999997
time_gauge_us{name="kdf/pedersen/5"} 0.38075
time_gauge_us{name="kdf/sha256/10"} 0.8754500000000001
time_gauge_us{name="kdf/sha256/3"} 0.38471
time_gauge_us{name="kdf/sha256/5"} 0.5203099999999999
time_gauge_us{name="merkletree/blake2s/1024"} 413.28
time_gauge_us{name="merkletree/blake2s/128"} 146.95
time_gauge_us{name="merkletree/pedersen/1024"} 17903
time_gauge_us{name="merkletree/pedersen/128"} 2210.7000000000003
time_gauge_us{name="parents in a loop/Blake2s/10"} 124.61
time_gauge_us{name="parents in a loop/Blake2s/1000"} 11443
time_gauge_us{name="parents in a loop/Blake2s/50"} 517.15
time_gauge_us{name="parents in a loop/Pedersen/10"} 144.03
time_gauge_us{name="parents in a loop/Pedersen/1000"} 9291.800000000001
time_gauge_us{name="parents in a loop/Pedersen/50"} 577.53
time_gauge_us{name="parents in a loop/Sha256/10"} 160.39
time_gauge_us{name="parents in a loop/Sha256/1000"} 8450.1
time_gauge_us{name="parents in a loop/Sha256/50"} 611.89
time_gauge_us{name="preprocessing/write_padded + unpadded/1024000"} 18303
time_gauge_us{name="preprocessing/write_padded + unpadded/128"} 465.47
time_gauge_us{name="preprocessing/write_padded + unpadded/2048000"} 31778
time_gauge_us{name="preprocessing/write_padded + unpadded/256"} 463.9
time_gauge_us{name="preprocessing/write_padded + unpadded/256000"} 5998.2
time_gauge_us{name="preprocessing/write_padded + unpadded/512"} 419.26
time_gauge_us{name="preprocessing/write_padded + unpadded/512000"} 9211.2
time_gauge_us{name="preprocessing/write_padded/1024000"} 6636.3
time_gauge_us{name="preprocessing/write_padded/128"} 230.69
time_gauge_us{name="preprocessing/write_padded/2048000"} 12681
time_gauge_us{name="preprocessing/write_padded/256"} 233.09
time_gauge_us{name="preprocessing/write_padded/256000"} 1849.1
time_gauge_us{name="preprocessing/write_padded/512"} 238.95
time_gauge_us{name="preprocessing/write_padded/512000"} 3240.6000000000004
time_gauge_us{name="sloth/decode-circuit-create_proof"} 5531.799999999999
time_gauge_us{name="sloth/decode-circuit-synthesize_circuit"} 1.3837
time_gauge_us{name="sloth/decode-non-circuit"} 0.005585
time_gauge_us{name="sloth/encode-non-circuit"} 0.004937400000000001
time_gauge_us{name="xor-circuit/create-proof"} 20208
time_gauge_us{name="xor-circuit/synthesize"} 490.56
time_gauge_us{name="xor/non-circuit/32"} 0.3122
time_gauge_us{name="xor/non-circuit/320"} 2.4207
time_gauge_us{name="xor/non-circuit/64"} 0.52199

I'm going to put diffs to the list above here. I will update this comment over time. @laser @dignifiedquire

In general, we may need a name-negotiation pass. I'm not going to fixate on getting all naming perfect first.

Not needed:

sloth_iter is not needed. Sloth is dead.

Needed:

wall-clock-sealing-time: total time (end - start) for sealing, disregarding CPU.
vector-commitment-time: should be CPU time, not wall-clock time.
max-memory: i.e. max resident set size of the process throughout its life.
layer-challenges: how many challenges were actually performed on each layer (conceptually: tuples of layer-index, challenge count).
vector-commitment-parallelism: how many cores were used for vector commitment (merkle tree) generation
circuit-proving-parallelism: how many cores were used for circuit proving

We can probably just use the CPU's core count for the parallelism numbers above — although that's not quite right. For example, since we parallelize replication and merkle-tree generation, the tree generation (except the final tree) can't use all cores. So aspirationally, we should capture this accurate, even if we don't initially.

For hash-function microbenchmarks, we also need circuit information:

circuit-time: combined synthesis and circuit proving time attributable to the function
num-constraints: number of constraints used in circuit

We need to capture some configuration — for example whether or not MAXIMIZE_CACHING is true or not. We will also need to be able to control values of such configuration when running benchmarks. This will matter more if/when configuration complexity increases. Another such value (not yet present in configuration) is the pedersen hashing window size (see #736).

Wherever 'cycles' appears, we mean 'pseudocycles', which is time / clock speed. So, for example, 1 second at 1GHz would be 1B pseudocycles. The idea is to get a quantity which can be used at least somewhat meaningfully to compare performance on different machines. It's not intended to measure actual processor cycles.

For initial work, it's probably easiest to ignore these numbers and instead report everything in seconds. As long as we also have the clock speed of the processor (which should be captured), we can calculate. NOTE: this will get more complicated if/when we introduce GPU to the timings.

@dignifiedquire - I have moved on to some of the infra/ops stuff (getting benchmarks running on master build, queuing benchmarks on Packet, etc.). I'm going to assign this story to you since you're going to be adding additional output (e.g. circuit stuff).

even though not everything got done, it seems the core issues are resolved, closing

filecoin-project / rust-fil-proofs