ChainSafe / lodestar

🌟 TypeScript Implementation of Ethereum Consensus
https://lodestar.chainsafe.io
Apache License 2.0
1.18k stars 289 forks source link

Flaky tests #6358

Open nazarhussain opened 9 months ago

nazarhussain commented 9 months ago

Describe the bug

There are some flaky tests needed be fixed.

Expected behavior

All tests should behave as stable they could.

Steps to reproduce

Additional context

During CI runs on different PRs we found few

Operating system

Linux

Lodestar version or commit hash

unstable

twoeths commented 9 months ago

A lot of failed tests happen with useWorker=true, I think it has something to do with the fact that vitest also run in worker thread for gossipsub, I suggest to only run on main thread mode for CI in https://github.com/ChainSafe/lodestar/pull/6368 we could do the same thing in req/resp tests too

I used to be able to stabalize e2e tests in "n-historical" state branch by only use useWorker=false

nazarhussain commented 9 months ago

A lot of failed tests happen with useWorker=true, I think it has something to do with the fact that vitest also run in worker thread

For this reason we already run e2e tests with forks not threads. https://github.com/ChainSafe/lodestar/blob/d6a7a3982b3a0dea9abda3ed8cb6e459d8620c31/vitest.base.e2e.config.ts#L11

wemeetagain commented 9 months ago

Sometimes the sim tests still fail on unrelated changes:

jeluard commented 9 months ago

Not directly a sim test error, but somewhat related: Error: Failed to CreateArtifact: Received non-retryable error: Failed request: (409) Conflict: an artifact with this name already exists on the workflow run

nflaig commented 9 months ago

Not directly a sim test error, but somewhat related: Error: Failed to CreateArtifact: Received non-retryable error: Failed request: (409) Conflict: an artifact with this name already exists on the workflow run

This happens since we merged https://github.com/ChainSafe/lodestar/pull/6410, the CI failed on that PR as well...

nflaig commented 8 months ago

Sim merge tests keep getting stuck and time out after 6h since we merged https://github.com/ChainSafe/lodestar/pull/6344 (as I noted already in the PR), see recent runs on unstable branch (1, 2).

nflaig commented 7 months ago

The browser tests keep failing due to different reasons, this one looks like a race condition (failed run)? The file has a test suite defined and it passes most of the time.

@lodestar/utils: ⎯⎯⎯⎯⎯⎯ Failed Suites 1 ⎯⎯⎯⎯⎯⎯⎯ @lodestar/utils: FAIL test/unit/err.test.ts [ test/unit/err.test.ts ] @lodestar/utils: Error: No test suite found in file /home/runner/actions-runner/_work/lodestar/lodestar/packages/utils/test/unit/err.test.ts @lodestar/utils: ⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯⎯[1/1]⎯

I would suggest we disable browser tests for now as they don't provide any value like this, and are just annoying to deal with

nflaig commented 7 months ago

test/unit/utils/clock.test.ts > util / Clock > getCurrentSlot > 'should return next slot after 11.5s'

Interestingly I have not seen that one in a while, maybe related to updating vitest? Vitest seems to have some internal issues with timings in general and fails to execute tests deterministically..hope they can improve this in the future

wemeetagain commented 3 weeks ago

@nazarhussain is this still an issue?

nazarhussain commented 3 weeks ago

@wemeetagain Not all, but I have seen e2e: lightclient api > getOptimisticUpdate() to be failing sometime.

Will need to review each case explicitly in past runs and check these out.