Investigating large e2e network test failures due to deployment issues

staheri14 commented 6 months ago

Running a large network e2e test consisting of 100 Knuu instances is currently experiencing some issues, causing the tests to fail halfway through. The primary problems are that images cannot be reused, requiring the creation of a new Docker image for each instance, and that tests fail to deploy mid-process for various reasons. The former is to be resolved in the new release of knuu (here is another issue tracking and addressing the integration of the new release candidate), however, the latter still needs investigation.

Below are some of the errors observed when tests failed to complete:

Main Reasons for Failures with Sample Error Logs:

The following errors happen nondeterministically, and usually disappear from one to run to another.

Context deadline exceeded

Failed to create service account or create content as the namespace is being terminated

failed to create testnet: cannot handle timeout: cannot start instance: error deploying pod for instance 'timeout-handler-b6be0473': failed to create service account: serviceaccounts "timeout-handler-b6be0473" is forbidden: unable to create new content in namespace test because it is being terminated

failed to get validators GRPC endpoints: error deploying service 'val16-d5aca67e': error deploying service 'val16-d5aca67e': error creating service val16-d5aca67e: services "val16-d5aca67e" is forbidden: unable to create new content in namespace test because it is being terminated

Pushing images fails

error pushing image for instance 'txsim5': failed to push image: failed to run command: exit status 1

Replicate the issue

Fetch the branch in https://github.com/celestiaorg/celestia-app/pull/3415

Run

go run ./test/e2e/benchmark LargeNetwork_BigBlock_8MiB -v

mojtaba-esk commented 6 months ago

@staheri14 It could be due to knuu internal timeout handler, have you tried to increase the timeout here?

Simply set this env var before the following line:

if err := os.Setenv("KNUU_TIMEOUT", "360m"); err != nil {
    return nil, err
}

https://github.com/celestiaorg/celestia-app/blob/6d83e7de0344328de428a3c74a7e81b2806e6db4/test/e2e/testnet/testnet.go#L42

Note: in the coming refactor, the ENV var thing will be removed and it will be hopefully easier for users.

smuu commented 5 months ago

I was able to run the following tests on the branch smuu/celestiaorg-celestia-app:smuu/improvements-to-big-block-tests, which is based on the branch celestiaorg/celestia-app:sanaz/big-block-test:

TwoNodeSimple
TwoNodeBigBlock_8MiB
TwoNodeBigBlock_32MiB
TwoNodeBigBlock_64MiB
LargeNetwork_BigBlock_8MiB
LargeNetwork_BigBlock_32MiB
LargeNetwork_BigBlock_64MiB

See the PR: https://github.com/celestiaorg/celestia-app/pull/3493

I made some fixes to the tests and some improvements to speed up the process. While doing that, I observed the following.

Observed the following for all tests > 8MiB

INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1

Observed that for all LargeNetwork tests

ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block.

INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938

Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?
Creating the port forwards for all the validators takes some time. We are working on an improvement for that.
With that number of instances, we cause client-side throttling against the Kubernetes API when running the tests. This does not make the test fail, but it slows down the test. We are working on an improvement for that.

Currently, the LargeNetwork tests are done with 50 validators. We will do testing to increase that number further. To support tests with 100 validators, I expect the need for improvements in knuu and the test level.

staheri14 commented 5 months ago

Thanks a lot @smuu for your great work on this! I also ran the tests, and they worked for me as well! Regarding the observed issues:

Observed the following for all tests > 8MiB INF Timed out dur=8945.650186 height=23 module=consensus round=0 step=1

I suppose this was a flakey behaviour? or not?

Observed that for all LargeNetwork tests ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

It is interesting, none of the indicated values exceed the reported limits i.e., 286<5000 and 1072647220<1073741824. Why it fails?

Each block does not have more than 8 txs. Sometimes I even just saw 2 txs per block. INF finalizing commit of block hash={} height=59 module=consensus num_txs=8 root=E5A106FF9549C30137FFABDCA454741A26645072C5D0E4BB81E4E7A55363D938

Well, it depends on the block size, all the submitted txs are of size 1.2 MiB, so, a block with size 8MiB cannot accommodate more than 8txs. However, for block sizes of 32 and 64, it should ideally go up to 26 and 53 txs, respectively. For which test did you see this issue?

Reading the blockchain takes very long -> more than 10m. Is there a way to speed that up?

Yes, I witnessed that too. We need to investigate. In my tests, I made the block height range smaller just to get a portion of the blocks to check if the network is operating. However, in long term, we might want to speed up the reading process.

evan-forbes commented 5 months ago

Observed the following for all tests > 8MiB

I suppose this was a flakey behaviour? or not?

If this is the timeout propose, then we should not expect to reach consensus over 8MB blocks with in that period.

ERR rejected valid incoming transaction; mempool is full err="mempool is full: number of txs 268 (max: 5000), total txs bytes 1072647220 (max: 1073741824)" tx=754F73CFD941C4D1AF82BE111B06F6F87E74BC8BC4D54C474472913E02D8C40C

woah why do we have over a gigabyte of txs in the mempool? how many sequences/txsim instances are we running total?

\

smuu commented 5 months ago

Is there anything else the DevOps team can support to close this issue?

staheri14 commented 4 months ago

Thanks @smuu and DevOps for your great help with this issue!

I am trying to run a 100-node test (100 validators + 100 txsims) but haven't been successful. Is it actually possible to run a test at this scale, or should we stick to a 50-node test (50 validators + 50 txsims)?

I also tried running the tests for 50 nodes but couldn't make a successful run. The issue is that the first txclient or validator is unable to start:

2024/06/24 15:27:36 failed to run the benchmark test: failed to start testnet: node val0 failed to start: timeout while waiting for instance 'val0-4f0b03bb' to be running

2024/06/24 15:37:53 failed to run the benchmark test: failed to start testnet: txsim txsim0 failed to start: timeout while waiting for instance 'txsim0-e6158a5a' to be running

When I attempt to access the logs, the nodes seem to have been torn down, so I am not sure what the root cause is.

I also tried running tests for smaller size network, and the same issue occurs, so maybe there is something off with the cluster?

I'll keep you posted if I find something new.

celestiaorg / celestia-app

Investigating large e2e network test failures due to deployment issues #3488

Main Reasons for Failures with Sample Error Logs:

Replicate the issue