Open markspanbroek opened 2 weeks ago
Another option to speed up tests: pass number of workers as a cli param (3 for tests, 1 for testnet). More work of course...
Another option to speed up tests: pass number of workers as a cli param. More work of course...
This is just a temporary fix. We should increase the number of workers as soon as we have threading for creating proofs.
I can't seem to get the integration tests to run stably. I fixed all the obvious cases where things could go wrong, and still they fail in unexpected places (while submitting a request, while running unit tests). I guess we really need to fix the issue with poll
in nim-ethers before we can continue with this PR :frowning_face:
I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️
Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?
Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?
Yes, I think so. It seems that subscription handlers not being called is causing the unpredictable test failures. When using only websockets the tests actually work quite reliably up to the point that a test takes longer than 5 minutes. When using http and polling the tests fail at random points. My suspicion is that this is caused by what Eric noticed; that an exception is being raised in the poll
function, that silently stops the polling.
I see, there is the https://github.com/codex-storage/nim-codex/pull/985 which might show the error. It could be worth rebase this PR on top of that to test your hypothesis. It is unfortunately not finished, but it should be working, there are just some polishing needed in order to merge it. I will try to finish those tomorrow, I won't make it today unfortunately.
So I've got somewhere on a solution for this, but it's not complete. The first part of the solution is to use the then
util to attach callbacks to poll
, allowing us to catch exceptions and then propagate them to the synchronous closure.
The second part, which I'm not sure about yet, is understanding whether or not we need to reconnect the RpcHttpClient
when this exception is raised (this particular exception only or all exceptions)?
The third part is trying to understand if we need this reconnection logic from part 2 in any other areas of ethers (including websockets).
@markspanbroek here's my work on keeping the polling loop alive: https://github.com/codex-storage/nim-ethers/tree/fix/keep-polling-alive.
asyncSpawn
ing or await
ing them. This is where the .then/.catch
comes in: it's effectively an asyncSpawn
but does not raise a Defect
and instead exposes the exceptions.RpcPostError
is encounteredDefect
on other errors, but i'm not so sure this is the right approach. We might need to handle all possible errors differently.then.nim
addition was taken from codex and was simplified a bit. It needs to be pulled into its own lib and used in both codex and ethers. The version in this branch is the most recent and will be used for the lib.here's my work on keeping the polling loop alive
Thanks @emizzle, I think I've now got a version of http polling that works based on these ideas. I also found 4 other bugs that made the tests unstable.
I'll try to get PRs for all of this in tomorrow.
On Linux and MacOS the tests run stable now. The Windows tests are running so slow that they are even failing on the fairly simple contract tests that do not involve proofs.
I'm installing a Windows VM to see if I can reproduce this on my machine.
Moved most of the fixes into #993, and rebased this PR on top of it.
I'm not sure whether it still makes sense to limit the amount of slot queue workers now that I've done some measurements on proof generation. It doesn't take that much time (~3 seconds).
reason: proof generation does not happen in separate threads yet, so to avoid taking too long to provide initial storage proofs we only pick up one slot at a time