Open staheri14 opened 6 months ago
@staheri14 It could be due to knuu internal timeout handler, have you tried to increase the timeout here?
Simply set this env var before the following line:
if err := os.Setenv("KNUU_TIMEOUT", "360m"); err != nil {
return nil, err
}
Note: in the coming refactor, the ENV var thing will be removed and it will be hopefully easier for users.
I was able to run the following tests on the branch smuu/celestiaorg-celestia-app:smuu/improvements-to-big-block-tests, which is based on the branch celestiaorg/celestia-app:sanaz/big-block-test:
See the PR: https://github.com/celestiaorg/celestia-app/pull/3493
I made some fixes to the tests and some improvements to speed up the process. While doing that, I observed the following.
Observed the following for all tests > 8MiB
INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1
Observed that for all LargeNetwork
tests
ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C
Each block does not have more than 8
txs. Sometimes I even just saw 2
txs per block.
INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938
Reading the blockchain takes very long -> more than 10m
. Is there a way to speed that up?
Creating the port forwards for all the validators takes some time. We are working on an improvement for that.
With that number of instances, we cause client-side throttling against the Kubernetes API when running the tests. This does not make the test fail, but it slows down the test. We are working on an improvement for that.
Currently, the LargeNetwork
tests are done with 50
validators. We will do testing to increase that number further.
To support tests with 100
validators, I expect the need for improvements in knuu and the test level.
Thanks a lot @smuu for your great work on this! I also ran the tests, and they worked for me as well! Regarding the observed issues:
- Observed the following for all tests > 8MiB INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1
I suppose this was a flakey behaviour? or not?
- Observed that for all LargeNetwork tests ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C
It is interesting, none of the indicated values exceed the reported limits i.e., 286<5000 and 1072647220<1073741824. Why it fails?
- Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block. INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938
Well, it depends on the block size, all the submitted txs are of size 1.2 MiB
, so, a block with size 8MiB
cannot accommodate more than 8
txs. However, for block sizes of 32
and 64
, it should ideally go up to 26
and 53
txs, respectively. For which test did you see this issue?
- Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?
Yes, I witnessed that too. We need to investigate. In my tests, I made the block height range smaller just to get a portion of the blocks to check if the network is operating. However, in long term, we might want to speed up the reading process.
Observed the following for all tests > 8MiB
I suppose this was a flakey behaviour? or not?
If this is the timeout propose, then we should not expect to reach consensus over 8MB blocks with in that period.
ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C
woah why do we have over a gigabyte of txs in the mempool? how many sequences/txsim instances are we running total?
\
Is there anything else the DevOps team can support to close this issue?
Thanks @smuu and DevOps for your great help with this issue!
I am trying to run a 100-node test (100 validators + 100 txsims) but haven't been successful. Is it actually possible to run a test at this scale, or should we stick to a 50-node test (50 validators + 50 txsims)?
I also tried running the tests for 50 nodes but couldn't make a successful run. The issue is that the first txclient or validator is unable to start:
2024/06/24 15:27:36 failed to run the benchmark test: failed to start testnet: node val0 failed to start: timeout while waiting for instance 'val0-4f0b03bb' to be running
2024/06/24 15:37:53 failed to run the benchmark test: failed to start testnet: txsim txsim0 failed to start: timeout while waiting for instance 'txsim0-e6158a5a' to be running
When I attempt to access the logs, the nodes seem to have been torn down, so I am not sure what the root cause is.
I also tried running tests for smaller size network, and the same issue occurs, so maybe there is something off with the cluster?
I'll keep you posted if I find something new.
Running a large network e2e test consisting of 100 Knuu instances is currently experiencing some issues, causing the tests to fail halfway through. The primary problems are that images cannot be reused, requiring the creation of a new Docker image for each instance, and that tests fail to deploy mid-process for various reasons. The former is to be resolved in the new release of knuu (here is another issue tracking and addressing the integration of the new release candidate), however, the latter still needs investigation.
Below are some of the errors observed when tests failed to complete:
Main Reasons for Failures with Sample Error Logs:
The following errors happen nondeterministically, and usually disappear from one to run to another.
Replicate the issue