Write and run test for testing consensus reactor for 134MB (512x512 ODS) blocks across 100 validators

musalbas commented 1 year ago

The purpose of this test is to (1) verify that the consensus reactor of CometBFT can deal with 134MB blocks across 100 validators, and (2) establish what is the minimum value for timeout_commit that can be used for such block sizes. We do not need to run any celestia-node nodes for the scope of this test. We should also disable mempool tx gossiping in this test and generate transactions locally, as the scope of this test is to test the consensus reactor only.

Setup

Create 100 validators using this branch that has a maximum ODS size of 512x512: https://github.com/musalbas/celestia-app/tree/musalbas/ods1024
Validators should have a realistic network latency setup
Set max_bytes in genesis.json to 1073741824 (1GB)
Set broadcast=false in config.toml
Set RecvRate and SendRate to 10000000 (10MB) in config.toml (we can try to adjust this later if that causes issues)
Set timeout_commit in config.toml to 3 - we should play with this to see what's the lowest we can get away with
Each validator should have 16 vCPUs, to ensure that constructing the erasure code isn't a bottleneck

Test

On each validator, run the txsim utility locally on a separate process, with the following options: --blob 500 --blob-sizes 1000000-1000000 --blob-amounts 1-1 --feegrant true. This will create 500 routines on each validator, that are sending PFBs with 1MB blobs. Validators will need an account with sufficient funds to do this.
Record (1) the block size of each block to verify that blocks are getting filled up; (2) the number of signatures of each block to verify that all validators are able to sign and commit each block; (3) network bandwidth statistics

Bidon15 commented 1 year ago

To achieve this we would need to keep 2 chapters in mind:

Infrastructure for Setup

In order to fulfil 16 vcpus per validator node, the node instance for the k8s cluster should be at least c5.4xlarge(16 vcpus) or c5.9xlarge where we have 32 vcpus per instance. I'd rather start with c5.4xlarge(we have them by default rn) and decrease to 14-15 vcpus at the start for the validator containers that each pod will serve. This will make sure that we scale 100 aws's node instances to a 1/1 ratio

According to recent test runs with full validators' set and QGB, the current infrastructure implementation should not be the bottleneck

Testing Environment

Code base

As per @celestiaorg/celestia-core team members historical usage, it's much better to either branch out test-infra to remove celestia-node part of tests' code base or just fork out and make a canonical test-infra-consensus repo
Most of the setup of validators is complete
config.toml configuration should be straight forward to accommodate changes we need. Same can be said to genesis.json if we need to modify whatever

Network setup

Validators should have a realistic network latency setup

We already have the necessary requisites to start playing with 0/100/200/300+ ms latencies and do it per validator to make the network more 'realistic'. Unfortunately the latency is not dynamically changeable during test execution

Still, I would recommend to kickstart with no bandwidth and latency limitations and go to the monitoring of validators to see unrestricted figures on network per pod/validator in grafana dashboards

Txsim

Currently, we already have a docker image of txsim, that we can pull into the testable DockerFile that is built and run from testground's pov.

This means that we can just add another CLI call in a go test code and point the celestia address of each validator as the master account for the txsim to produce big blobs submissions

--blob 500 --blob-sizes 1000000-1000000 --blob-amounts 1-1 --feegrant true

evan-forbes commented 1 year ago

ref https://github.com/celestiaorg/celestia-core/issues/945 and https://github.com/celestiaorg/celestia-app/issues/2033

celestiaorg / test-infra