Improve test reliability

area commented 1 month ago

After the merge of #1254, we got a big improvement in the time our tests take to run, but that improvement came at a cost of the suite as a whole becoming unreliable, as race conditions that could previously be ignored reared their ugly head. This PR fixes two tests, and seems to get us back to stability - four out of five builds on Circle of this branch were green, with the fifth failing due to a 502 HTTP error from Github during the build (see those builds here), which hardly seems like something we can blame ourselves for.

Two tests were touched here. The first was the test around the oracle syncing with the reputation state correctly. With a faster hardhat, sometimes the oracleUpdated promise resolved at the first check, not hitting the else branch in the promise to forwardTime(1, this). In that scenario, without that block being mined, the reputation miner client was still waiting for another block before updating, and so the oracle ended up ahead of the miner and the states did not match at the end of the test.

The second test touched here was the classic "Losing a race shouldn't prevent a miner from continuing". I'm less certain about what was happening here, but I think it was to do with the miner still processing when a mineBlock was called, and then there not being another block to trigger the updates. This is the sort of thing that should be handled by endDoBlockChecks calling doBlockChecks when another block has been seen, however, so not entirely sure what to make of this. Perhaps I've stumbled on to a correct solution for whatever the 'real' underlying problem is; I'd be interested in some insight on that front.

area commented 1 month ago

Jumped the gun here, back to draft for now.

area commented 1 month ago

Still not done, but while I remember:

When a miner is working out whether it can confirm a hash or otherwise respond to a dispute, it's possible for the cycle to be confirmed 'out from under it' - in that scenario, a call to getMySubmissionRoundAndIndex would return with [-1, -1], which are not valid arguments to various other functions expecting uint256. As it was, the miner would crash; I've changed it so it will account for that possibility, which is entirely possible.
ECONNRESET can happen if there's a network hiccup even with a good endpoint, so the miner should not crash in such a scenario and continue mining. Even if this happens during e.g. syncing, we should be able to recover. I've also mitigated against this in the tests by making the default provider a RetryProvider, which I've refactored out so it can be used more freely.
We were using the wrong stage in the require for confirmJustificationRootHash in the contracts, whereas the miningclients were using the correct one.

area commented 1 month ago

screenshot-2024-07-08_09-36-17

Each of those jobs represents 13 runs of the client-auto-functionality tests, which were the only ones that were consistently failing. This collection of changes demonstrably represents a massive improvement in test reliability compared to the current develop, but even develop prior to #1254 being merged:

Previous develop: ~15% failure rate based on last 20 builds on develop before we merged #1254 Current develop: ~60% error rate based on 5 builds on develop and hardhat20 branch (large error bars) This branch: ~1% failure rate based on 643 builds(!) (i.e. more than in that screenshot).

In addition to the changes described above, there are the following:

If an error occurred in doBlockChecks before repCycle had been set, the error handler would fail, as it was trying to access .address on undefined when attempting to get the code at the repCycle address (NB that this behaviour is now outdated with the selfdestruct changes, and should be updated).
Updated the 'forward to' timestamps in the dispute resolution test; if we go to a timestamp that's the exact start of a window, nothing is allowed to respond; I think in the past, when we were slower, the subsequent mineBlocks ended up causing the timestamp to tick forward anyway, but that no longer seems to be the case, consistently.
Correctly deal with removing listeners in the auto-functionality tests; removeListener on an event is an async function, and so must be awaited, and I also added some additional cleanup where other listeners were being used.
Reworked the logic in the test to check that we don't invalidate our own submission; it definitely had some at least suboptimal logic (await forwardTimeTo(parseInt(goodEntry.lastResponseTimestamp, 10)); being a especially large offender).
Wait for ReputationMinerClient2 to finish what it's doing in losing-race-continue test before killing it, so we do so cleanly
Use the Ethers reputation mining cycle everywhere where I can in those tests.

The last point is related to some investigation that I did while trying to squeeze out that last 1% of flakiness, but the rest of the changes I explored aren't included here. That last 1% (aside from random connectivity issues) seems to be due to ECONNRESET and other side closed errors, which are all coming out of the same package.

pnpm why web3-core-requestmanager says we have that package because of @truffle/contract and @nomiclabs/hardhat-truffle5, so it's possible that 'truly' moving to hardhat would solve this problem, and/or moving to Ethers contracts everywhere. It's also entirely possible that the fault lies with hardhat node; there are definitely issues with undici outstanding (see https://github.com/NomicFoundation/hardhat/issues/3136), which is where other side closed is coming from. In either case, I feel like they are outside the scope of what I want to get in to with this PR, and I haven't figured out why it's only these tests that seem to cause them.

The biggest thing I tried to solve this, not included here, was using bun as the runner for these tests, but I still got the issues. But it's good to know that it works, at least, if something incredibly bad crops up in Node?

Unrelated to spottiness, this PR also includes:

It would seem that after an image update on Circle's side, our security job fails. I've fixed it by adding a sudo to the command to install pnpm, though obviously not wildly happy about that solution.
We had hardhat-waffle installed for... no reason? So I've removed it. Same with hardhat-web3.

JoinColony / colonyNetwork

Improve test reliability #1273