Benchmark + optimize aggregate and proof validation

dapplion commented 3 years ago

Metrics from Prater in one of our Contabo VPS S size nodes, show that when the node is synced 80% of CPU time is spent validating aggregate and proof gossip messages. To keep up we also drop 35% of all received messages.

Screenshot from 2021-03-31 09-21-52

The average job duration is 20-30ms. In Prater stable conditions all attestation target states should be in the cache and cost 0 to get. Then the only big cost is signature validation, which has 3 (selection proof + aggregator sig + att sig aggregate). A BLS sig costs between 1-2 ms, and since the 3 sigs are verified in batch it should have a discount of ~50%. So the total job time should be between 1.5-3ms.

We should investigate the performance of that validation since there is significant room for improvement.

twoeths commented 3 years ago

there are almost 600 different aggregate and proof items per slot, each validateAggregateAttestation call takes 6ms in average (some exceptional ones may take up to 60ms due to gc), most of the time this function does is to do batch signature verification. For this, we should consider increasing our job queue size.
there are some duplicate attestations (with different aggregator), it's a redundancy to validate latter ones since in the end, attestation and proof db has attestation root as key
we should create TreeBacked value for gossip AggregateAndProof since we need to do struct_hashTreeRoot in a couple of places (getAggregateAndProofSignatureSet, getIndexedAttestationSignatureSet and AggregateAndProofRepository.add)
no other heavy operations except for signature verification, it's just too many aggregate and proof to validate

prater_validate_aggregate_attestation_0405.cpuprofile.zip

dapplion commented 2 years ago

Already addressed with https://github.com/ChainSafe/lodestar/pull/2760 and https://github.com/ChainSafe/lodestar/pull/2801

ChainSafe / lodestar

Benchmark + optimize aggregate and proof validation #2306