celestiaorg / celestia-app

PoS application for the consensus portion of the Celestia network. Built using celestia-core (fork of CometBFT) and the cosmos-sdk
https://celestia.org
Apache License 2.0
328 stars 261 forks source link

Comprehensive Bandwidth Utilization Analysis Across Network Experiments (Knuu and Testground) #3576

Open staheri14 opened 1 week ago

staheri14 commented 1 week ago

Problem

Previously, when aiming to debug network behavior w.r.t. bandwidth utilization in Testground, our focus was often narrowed to a single height of the experiment or even to a specific node or pair of nodes. However, for comparing the performance of two experiments (i.e., sanity check of testground using knuu), a broader view is essential. We need statistics that provide insights into the state of bandwidth utilization across the entire experiment, including all heights and nodes, rather than focusing on a single height or node.

Proposed Solution

Recall that we aim to answer the question, "Is Testground the reason why Comet can’t utilize its bandwidth effectively?" One way to assess this, which is the approach pursued in this proposal, is by examining the best performance of P2P connections to determine if Testground performs as well as Knuu.

I have developed a method to summarize the bandwidth utilization of the entire network when nodes are at their highest performance. This method is focused on providing a comprehensive overview of the bandwidth utilization across the network by utilizing the data traced in received bytes table[for additional context: received bytes capture the size of messages received by a node from all of its connections across all channel IDs]. The process involves:

  1. Identifying the Best Connection: For each node in the experiment, I identify the best connection, which is defined as the peer whose top 10% traffic rate outperforms all other connections of that node.

    • Alternatively, we could focus on the block propagation time periods. Specifically, for each height, we would identify when a node starts receiving block parts from its connections and capture the received rate for that period only. This approach targets the busiest period of the consensus state machine. While feasible, it requires cross-referencing multiple traced tables, which could increase the likelihood of errors. Analyzing the top 10% (or another percentage) of received rates is likely just as effective for gaining insights into the maximum performance of P2P connections.
  2. Calculating Statistics: For the identified best connection of each node, the following statistics are calculated for the top 10% of the times:

    • Average
    • Standard Deviation
    • Minimum
    • Maximum
    • Total number of samples in the determination of the top 10% traffic. Each sample represents 1 second.
  3. Summarizing for All Nodes: The above calculations are performed for all nodes in the experiment.

  4. Extracting and Comparing Data: A table is extracted to summarize and compare data from Knuu and Testground, considering the nodes' degree (number of connections). The statistics of nodes with similar degrees are compared between the two platforms. We expect to see similar performance, i.e., received traffic rate, across nodes with similar degrees in both backends.

Action Items

staheri14 commented 1 week ago

This comment is intended to illustrate how the results look and does not provide a full comparison between Knuu and Testground. I will compile a document with the actual comparison.

Results from KNUU: consisting of 45 node experiments, with each validator having a send and receive rate of 5MB/s. The total duration of the experiment is 38 mins i.e., 2330 seconds. Each row represents the statistics for one node, with the node in the first column. The peer ID of its best connection is listed under best_connection. The table has less than 45 rows, and the reason is that the traced data for some of validators were problematic (couldn't be parsed), hence, no results could be reported for them.

The receive traffic rate is calculated by summing the size of messages received from the best_connection every second (referred to as a sample in the table). The total_samples represents 10% of those samples that contributed to the reported statistics. For example, if this value is 233, it indicates that those peers were engaged for a total of 2330 seconds, and the statistics were obtained by considering the top 10% of those samples. This is used to assess the reliability of the reported statistics.

knuu_bps

staheri14 commented 1 week ago

cc: @evan-forbes