celestiaorg / test-infra

Testing infrastructure for the Celestia Network
Apache License 2.0
25 stars 10 forks source link

Write and run test for testing consensus reactor for 134MB (512x512 ODS) blocks across 100 validators #225

Open musalbas opened 1 year ago

musalbas commented 1 year ago

The purpose of this test is to (1) verify that the consensus reactor of CometBFT can deal with 134MB blocks across 100 validators, and (2) establish what is the minimum value for timeout_commit that can be used for such block sizes. We do not need to run any celestia-node nodes for the scope of this test. We should also disable mempool tx gossiping in this test and generate transactions locally, as the scope of this test is to test the consensus reactor only.

Setup

Test

Bidon15 commented 1 year ago

To achieve this we would need to keep 2 chapters in mind:

Infrastructure for Setup

In order to fulfil 16 vcpus per validator node, the node instance for the k8s cluster should be at least c5.4xlarge(16 vcpus) or c5.9xlarge where we have 32 vcpus per instance. I'd rather start with c5.4xlarge(we have them by default rn) and decrease to 14-15 vcpus at the start for the validator containers that each pod will serve. This will make sure that we scale 100 aws's node instances to a 1/1 ratio

According to recent test runs with full validators' set and QGB, the current infrastructure implementation should not be the bottleneck

Testing Environment

Code base

  1. As per @celestiaorg/celestia-core team members historical usage, it's much better to either branch out test-infra to remove celestia-node part of tests' code base or just fork out and make a canonical test-infra-consensus repo
  2. Most of the setup of validators is complete
  3. config.toml configuration should be straight forward to accommodate changes we need. Same can be said to genesis.json if we need to modify whatever

Network setup

Validators should have a realistic network latency setup

We already have the necessary requisites to start playing with 0/100/200/300+ ms latencies and do it per validator to make the network more 'realistic'. Unfortunately the latency is not dynamically changeable during test execution

Still, I would recommend to kickstart with no bandwidth and latency limitations and go to the monitoring of validators to see unrestricted figures on network per pod/validator in grafana dashboards

Txsim

Currently, we already have a docker image of txsim, that we can pull into the testable DockerFile that is built and run from testground's pov.

This means that we can just add another CLI call in a go test code and point the celestia address of each validator as the master account for the txsim to produce big blobs submissions

--blob 500 --blob-sizes 1000000-1000000 --blob-amounts 1-1 --feegrant true
evan-forbes commented 1 year ago

ref https://github.com/celestiaorg/celestia-core/issues/945 and https://github.com/celestiaorg/celestia-app/issues/2033