JoinColony / colonyNetwork

Colony Network smart contracts
https://colony.io/
GNU General Public License v3.0
438 stars 106 forks source link

Improve test reliability #1273

Closed area closed 1 month ago

area commented 1 month ago

After the merge of #1254, we got a big improvement in the time our tests take to run, but that improvement came at a cost of the suite as a whole becoming unreliable, as race conditions that could previously be ignored reared their ugly head. This PR fixes two tests, and seems to get us back to stability - four out of five builds on Circle of this branch were green, with the fifth failing due to a 502 HTTP error from Github during the build (see those builds here), which hardly seems like something we can blame ourselves for.

Two tests were touched here. The first was the test around the oracle syncing with the reputation state correctly. With a faster hardhat, sometimes the oracleUpdated promise resolved at the first check, not hitting the else branch in the promise to forwardTime(1, this). In that scenario, without that block being mined, the reputation miner client was still waiting for another block before updating, and so the oracle ended up ahead of the miner and the states did not match at the end of the test.

The second test touched here was the classic "Losing a race shouldn't prevent a miner from continuing". I'm less certain about what was happening here, but I think it was to do with the miner still processing when a mineBlock was called, and then there not being another block to trigger the updates. This is the sort of thing that should be handled by endDoBlockChecks calling doBlockChecks when another block has been seen, however, so not entirely sure what to make of this. Perhaps I've stumbled on to a correct solution for whatever the 'real' underlying problem is; I'd be interested in some insight on that front.

area commented 1 month ago

Jumped the gun here, back to draft for now.

area commented 1 month ago

Still not done, but while I remember:

area commented 1 month ago

screenshot-2024-07-08_09-36-17

Each of those jobs represents 13 runs of the client-auto-functionality tests, which were the only ones that were consistently failing. This collection of changes demonstrably represents a massive improvement in test reliability compared to the current develop, but even develop prior to #1254 being merged:

Previous develop: ~15% failure rate based on last 20 builds on develop before we merged #1254 Current develop: ~60% error rate based on 5 builds on develop and hardhat20 branch (large error bars) This branch: ~1% failure rate based on 643 builds(!) (i.e. more than in that screenshot).

In addition to the changes described above, there are the following:

The last point is related to some investigation that I did while trying to squeeze out that last 1% of flakiness, but the rest of the changes I explored aren't included here. That last 1% (aside from random connectivity issues) seems to be due to ECONNRESET and other side closed errors, which are all coming out of the same package.

pnpm why web3-core-requestmanager says we have that package because of @truffle/contract and @nomiclabs/hardhat-truffle5, so it's possible that 'truly' moving to hardhat would solve this problem, and/or moving to Ethers contracts everywhere. It's also entirely possible that the fault lies with hardhat node; there are definitely issues with undici outstanding (see https://github.com/NomicFoundation/hardhat/issues/3136), which is where other side closed is coming from. In either case, I feel like they are outside the scope of what I want to get in to with this PR, and I haven't figured out why it's only these tests that seem to cause them.

The biggest thing I tried to solve this, not included here, was using bun as the runner for these tests, but I still got the issues. But it's good to know that it works, at least, if something incredibly bad crops up in Node?

Unrelated to spottiness, this PR also includes: