filecoin-project / rust-fil-proofs

Proofs for Filecoin in Rust
Other
492 stars 315 forks source link

Get realistic benchmarks into uber-calc (by Aug 2) #761

Closed laser closed 5 years ago

laser commented 5 years ago

Description

Notes:

Acceptance criteria

Risks + pitfalls

Where to begin

dignifiedquire commented 5 years ago

Overview

Data collection

The data points below can currently be collected using two different scripts

Data storage

The current solution can store the information into prometheus. This might or might not be a good idea. An alternative that is possible much simpler and more flexible in the long run is to store the data into a sql or nosql database.

Reporting

Configuration

Execution

Data Points needed

ZigZag

Parameters

Measurements

Micro Benchmarks

Other Details to collect

laser commented 5 years ago

Notes

Using Benchy with ZigZag

Running

From rust-filbase root:

cargo build --release --features benchy \
  && ./target/release/filbase benchy zigzag --size 1024

Output

Replication: total time: 14.9740s
Replication: time per byte: 14.2800us
Vanilla proving: 533.4730us
Avg verifying: 0.3421s
Total proving: 0.0000s
# HELP circuit_num_constraints Number of constraints of the circuit
# TYPE circuit_num_constraints gauge
circuit_num_constraints{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 0
# HELP circuit_num_inputs Number of inputs to the circuit
# TYPE circuit_num_inputs gauge
circuit_num_inputs{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 0
# HELP replication_time_ms Total replication timea
# TYPE replication_time_ms gauge
replication_time_ms{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 14974
# HELP replication_time_ns_per_byte Replication time per byte
# TYPE replication_time_ns_per_byte gauge
replication_time_ns_per_byte{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 14280
# HELP vanilla_proving_time_us Vanilla proving time
# TYPE vanilla_proving_time_us gauge
vanilla_proving_time_us{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 533
# HELP vanilla_verification_time_us Vanilla verification time
# TYPE vanilla_verification_time_us gauge
vanilla_verification_time_us{data_size_bytes="1048576",expansion_degree="8",hasher="pedersen",layers="10",m="5",partitions="1",samples="5",sloth_iter="0"} 324979

Capturing Micro-Benchmarks

Running

From rust-proofs root:

cargo build --release --all \
  && cargo run --bin micro -p fil-proofs-tooling --release

Output

# HELP time_gauge_us time gauge help
# TYPE time_gauge_us gauge
time_gauge_us{name="bytes-32-to-fr"} 0.049158
time_gauge_us{name="encode-node/blake2s/10"} 0.86665
time_gauge_us{name="encode-node/blake2s/3"} 0.53398
time_gauge_us{name="encode-node/blake2s/5"} 0.59715
time_gauge_us{name="encode-node/pedersen/10"} 0.84397
time_gauge_us{name="encode-node/pedersen/3"} 0.38849
time_gauge_us{name="encode-node/pedersen/5"} 0.48761
time_gauge_us{name="encode-node/sha256/10"} 0.8845599999999999
time_gauge_us{name="encode-node/sha256/3"} 0.39442
time_gauge_us{name="encode-node/sha256/5"} 0.50918
time_gauge_us{name="fr-to-bytes-32"} 0.092348
time_gauge_us{name="hash-blake2s-circuit/create-proof"} 307720
time_gauge_us{name="hash-blake2s-circuit/synthesize"} 0.58429
time_gauge_us{name="hash-blake2s/non-circuit/32"} 0.13493
time_gauge_us{name="hash-blake2s/non-circuit/320"} 0.47986
time_gauge_us{name="hash-blake2s/non-circuit/64"} 0.12398
time_gauge_us{name="hash-pedersen-circuit/create-proof"} 37799
time_gauge_us{name="hash-pedersen-circuit/synthesize"} 1434.7
time_gauge_us{name="hash-pedersen/non-circuit/32"} 18.722
time_gauge_us{name="hash-pedersen/non-circuit/320"} 397.31
time_gauge_us{name="hash-pedersen/non-circuit/64"} 34.457
time_gauge_us{name="hash-sha256-circuit/create-proof"} 288300
time_gauge_us{name="hash-sha256-circuit/synthesize"} 30697
time_gauge_us{name="hash-sha256/non-circuit/32"} 0.34006000000000003
time_gauge_us{name="hash-sha256/non-circuit/320"} 1.6986
time_gauge_us{name="hash-sha256/non-circuit/64"} 0.61475
time_gauge_us{name="kdf/blake2s/10"} 0.75461
time_gauge_us{name="kdf/blake2s/3"} 0.30211
time_gauge_us{name="kdf/blake2s/5"} 0.41220999999999997
time_gauge_us{name="kdf/pedersen/10"} 0.7060700000000001
time_gauge_us{name="kdf/pedersen/3"} 0.25906999999999997
time_gauge_us{name="kdf/pedersen/5"} 0.38075
time_gauge_us{name="kdf/sha256/10"} 0.8754500000000001
time_gauge_us{name="kdf/sha256/3"} 0.38471
time_gauge_us{name="kdf/sha256/5"} 0.5203099999999999
time_gauge_us{name="merkletree/blake2s/1024"} 413.28
time_gauge_us{name="merkletree/blake2s/128"} 146.95
time_gauge_us{name="merkletree/pedersen/1024"} 17903
time_gauge_us{name="merkletree/pedersen/128"} 2210.7000000000003
time_gauge_us{name="parents in a loop/Blake2s/10"} 124.61
time_gauge_us{name="parents in a loop/Blake2s/1000"} 11443
time_gauge_us{name="parents in a loop/Blake2s/50"} 517.15
time_gauge_us{name="parents in a loop/Pedersen/10"} 144.03
time_gauge_us{name="parents in a loop/Pedersen/1000"} 9291.800000000001
time_gauge_us{name="parents in a loop/Pedersen/50"} 577.53
time_gauge_us{name="parents in a loop/Sha256/10"} 160.39
time_gauge_us{name="parents in a loop/Sha256/1000"} 8450.1
time_gauge_us{name="parents in a loop/Sha256/50"} 611.89
time_gauge_us{name="preprocessing/write_padded + unpadded/1024000"} 18303
time_gauge_us{name="preprocessing/write_padded + unpadded/128"} 465.47
time_gauge_us{name="preprocessing/write_padded + unpadded/2048000"} 31778
time_gauge_us{name="preprocessing/write_padded + unpadded/256"} 463.9
time_gauge_us{name="preprocessing/write_padded + unpadded/256000"} 5998.2
time_gauge_us{name="preprocessing/write_padded + unpadded/512"} 419.26
time_gauge_us{name="preprocessing/write_padded + unpadded/512000"} 9211.2
time_gauge_us{name="preprocessing/write_padded/1024000"} 6636.3
time_gauge_us{name="preprocessing/write_padded/128"} 230.69
time_gauge_us{name="preprocessing/write_padded/2048000"} 12681
time_gauge_us{name="preprocessing/write_padded/256"} 233.09
time_gauge_us{name="preprocessing/write_padded/256000"} 1849.1
time_gauge_us{name="preprocessing/write_padded/512"} 238.95
time_gauge_us{name="preprocessing/write_padded/512000"} 3240.6000000000004
time_gauge_us{name="sloth/decode-circuit-create_proof"} 5531.799999999999
time_gauge_us{name="sloth/decode-circuit-synthesize_circuit"} 1.3837
time_gauge_us{name="sloth/decode-non-circuit"} 0.005585
time_gauge_us{name="sloth/encode-non-circuit"} 0.004937400000000001
time_gauge_us{name="xor-circuit/create-proof"} 20208
time_gauge_us{name="xor-circuit/synthesize"} 490.56
time_gauge_us{name="xor/non-circuit/32"} 0.3122
time_gauge_us{name="xor/non-circuit/320"} 2.4207
time_gauge_us{name="xor/non-circuit/64"} 0.52199
porcuquine commented 5 years ago

I'm going to put diffs to the list above here. I will update this comment over time. @laser @dignifiedquire

In general, we may need a name-negotiation pass. I'm not going to fixate on getting all naming perfect first.

Not needed:

Needed:

We can probably just use the CPU's core count for the parallelism numbers above — although that's not quite right. For example, since we parallelize replication and merkle-tree generation, the tree generation (except the final tree) can't use all cores. So aspirationally, we should capture this accurate, even if we don't initially.

For hash-function microbenchmarks, we also need circuit information:

We need to capture some configuration — for example whether or not MAXIMIZE_CACHING is true or not. We will also need to be able to control values of such configuration when running benchmarks. This will matter more if/when configuration complexity increases. Another such value (not yet present in configuration) is the pedersen hashing window size (see #736).


Wherever 'cycles' appears, we mean 'pseudocycles', which is time / clock speed. So, for example, 1 second at 1GHz would be 1B pseudocycles. The idea is to get a quantity which can be used at least somewhat meaningfully to compare performance on different machines. It's not intended to measure actual processor cycles.

For initial work, it's probably easiest to ignore these numbers and instead report everything in seconds. As long as we also have the clock speed of the processor (which should be captured), we can calculate. NOTE: this will get more complicated if/when we introduce GPU to the timings.


laser commented 5 years ago

@dignifiedquire - I have moved on to some of the infra/ops stuff (getting benchmarks running on master build, queuing benchmarks on Packet, etc.). I'm going to assign this story to you since you're going to be adding additional output (e.g. circuit stuff).

dignifiedquire commented 5 years ago

even though not everything got done, it seems the core issues are resolved, closing