Open namn-grg opened 7 months ago
Hey @namn-grg, what we usually do to obtain tx latencies on CometBFT experiments is to create each tx with an id and a timestamp in its content, instead of just putting random data. Then we can query the timestamp of each committed block, take each tx in the block, and compare the block timestamp to the timestamp encoded in the tx. This is the original implementation: https://github.com/tendermint/tendermint/pull/9351 And this is the latest QA report with some latency plots as example: https://github.com/cometbft/cometbft/blob/main/docs/references/qa/CometBFT-QA-v1.md#latencies
If I understand correctly, you want to analyse the logs to calculate latencies. We haven't tried that approach though I imagine the log will be huge, considering that our experiments have millions of txs.
Thanks for the help @hvanz , I think I figured out what to do.
QQ - is the latency output from test/loadtime/cmd/report
is E2E latency? i.e. mean( time[tx_sent_from_client] - time[tx_committed_inside_a_block] )
That's right. That's the tool we use for computing latencies.
@hvanz This is a test we ran on a cometbft network, do you think the results is justified? i.e. we calculated appropriately. Or any other views you have would be helpful
SUMMARY:
CONFIG: Faults: 0 nodes Committee size: 4 nodes Input rate: 1,000 tx/s Transaction size: 256 B Execution time: 96 s
RESULTS: Consensus TPS: 933 tx/s Consensus BPS: 238,917 B/s Consensus latency: 289 ms End-to-end TPS: 958 tx/s End-to-end BPS: 245,280 B/s End-to-end latency: 951 ms
Yes, those numbers seem ok for a simple local network.
Goal
Calculate the E2E latency for the CometBFT consensus mechanism under a load test scenario.
How is E2E Latency calculated?
It is calculated by taking the mean of the time difference between went the tx is sent from the client side and when it is committed inside a block. E2E Latency = mean( time[tx_sent_from_client] - time[tx_committed_inside_a_block] )
Problem 1
The cometbft client (load-test) sends txs in a batch. The logs for the node don't show the time at which a particular tx was received.
Problem 2
The random (key = value) data generated at all the nodes by the client sometimes leads to 2 different nodes generating the same value, which results in that tx not getting included in the mempool (log - "tx already exists in the mempool"). This will lead to incorrect results as the expected tx rate is not actually achieved. The tx is supposed to get committed, but because of some duplicate txs it isn't.
Possible Solution
Output an additional debug log on the client side for a tx after every set of specific. Debug log - f"Tx {tx_num} in batch {batch_num} successfully sent" This will give us the time stamp of that particular tx number.
The tx data is generated randomly. We can add the identifier specified below at the start. This will allow us to catch it in the node logs. (Wont work in case of small data size as depends upon the identifier length)
Identifier = node_id + batch_num + tx_num // make it hex data_string = identifier + "f" * (length - len(identifier) )
Result
We will know the specific tx to look for using the identifier in the node logs from the newly added client debug log. We will match the identifier and get the time stamp.