celestiaorg / celestia-app

Celestia consensus node
https://celestiaorg.github.io/celestia-app/
Apache License 2.0
345 stars 292 forks source link

Fix flakes in e2e tests #4028

Open evan-forbes opened 2 weeks ago

evan-forbes commented 2 weeks ago

Currently, the e2e tests can be quite flakey. Some flakes are due to the e2e test logic itself, others might be due to more knuu related things.

There were a bunch of txsim related ones, but I think those have been fixed so I'm not including them here for the time being. I'm also not including any from MajorUpgradeToV3 as that is a known issue in https://github.com/celestiaorg/celestia-app/issues/4023

Observed Flakes:

1)

ERROR E2ESimple
test-e2e2024/10/08 17:03:33 --- ERROR E2ESimple: expected at least 10 transactions, got 0

2)

2024/10/08 14:48:10 failed to wait for height: post failed: Post "[http://151.115.12.124:80/val3-26657](http://151.115.12.124/val3-26657)": context deadline exceeded
exit status 1

3)

RUN MinorVersionCompatibility
2024/10/02 14:51:35 failed to upgrade node: error waiting for instance 'val3' to be running: error checking if instance 'val3' is running: failed to get pod val3: replicasets.apps "val3" not found
exit status 1

4)

MinorVersionCompatibility
2024/10/08 12:45:45 Failed to start testnet: failed to start node val0: error getting status

5) this failure is likely due to not skipping v1.8.0, which was retracted iirc, not sure if that fixes the others as well

{"level":"debug","RPC Address":"[http://151.115.12.124:80/val2-26657](http://151.115.12.124/val2-26657)","time":"2024-10-08T14:46:26Z","message":"Creating HTTP client for node"}
test-e2e2024/10/08 14:46:26 Upgrading node node 3 version v1.5.0
{"level":"debug","RPC Address":"[http://151.115.12.124:80/val3-26657](http://151.115.12.124/val3-26657)","time":"2024-10-08T14:46:50Z","message":"Creating HTTP client for node"}
test-e2e2024/10/08 14:46:51 Upgrading node node 4 version v1.8.0 
2024/10/08 14:48:10 failed to wait for height: post failed: Post "http://151.115.12.124:80/val3-[266](https://github.com/celestiaorg/celestia-app/actions/runs/11237776588/job/31240969532#step:5:267)57": context deadline exceeded
exit status 1
make: *** [Makefile:147: test-e2e] Error 1

6)

MinorVersionCompatibility
2024/10/08 14:48:10 failed to wait for height: post failed: Post "[http://151.115.12.124:80/val3-26657](http://151.115.12.124/val3-26657)": context deadline exceeded
exit status 1

7)

no tests where able to be started

time="2024-10-02T14:38:45Z" level=info msg="Pod statuses" file="k8s/pod_status.go:100" pod_statuses="Pending: 4 , Running: 6 "
time="2024-10-02T14:39:45Z" level=warning msg="Pods pending for too long" file="k8s/pod_status.go:99" pending_pods="knuu-preloader-5b27e367-mgl9f, knuu-preloader-5b27e367-sc8ht, knuu-preloader-5b27e367-vvgwk, val3-6664ee73-cwqkt"
time="2024-10-02T14:39:45Z" level=info msg="Pod statuses" file="k8s/pod_status.go:100" pod_statuses="Running: 6 , Pending: 4 "
2024/10/02 14:39:55 failed to upgrade node: error waiting for instance to be running: error checking if instance 'val3-6664ee73' is running: failed to get pod val3-6664ee73: replicasets.apps "val3-6664ee73" not found
exit status 1

8)

New state sync test 
2024/11/03 18:22:12 failed to get header: error in json rpc client, with http response metadata: (Status: 503 Service Unavailable, Protocol HTTP/1.1). error unmarshalling: invalid character 'o' in literal null (expecting 'u')
exit status 1