codex-storage / nim-codex

Decentralized Durability Engine
https://codex.storage
Apache License 2.0
69 stars 25 forks source link

limit amount of slot queue workers to 1 #988

Open markspanbroek opened 2 weeks ago

markspanbroek commented 2 weeks ago

reason: proof generation does not happen in separate threads yet, so to avoid taking too long to provide initial storage proofs we only pick up one slot at a time

emizzle commented 2 weeks ago

Another option to speed up tests: pass number of workers as a cli param (3 for tests, 1 for testnet). More work of course...

markspanbroek commented 2 weeks ago

Another option to speed up tests: pass number of workers as a cli param. More work of course...

This is just a temporary fix. We should increase the number of workers as soon as we have threading for creating proofs.

markspanbroek commented 2 weeks ago

I can't seem to get the integration tests to run stably. I fixed all the obvious cases where things could go wrong, and still they fail in unexpected places (while submitting a request, while running unit tests). I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR :frowning_face:

AuHau commented 1 week ago

I guess we really need to fix the issue with poll in nim-ethers before we can continue with this PR ☹️

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

markspanbroek commented 1 week ago

Hmm now I am not sure which one exactly? The one @emizzle found related to Future being cancelled?

Yes, I think so. It seems that subscription handlers not being called is causing the unpredictable test failures. When using only websockets the tests actually work quite reliably up to the point that a test takes longer than 5 minutes. When using http and polling the tests fail at random points. My suspicion is that this is caused by what Eric noticed; that an exception is being raised in the poll function, that silently stops the polling.

AuHau commented 1 week ago

I see, there is the https://github.com/codex-storage/nim-codex/pull/985 which might show the error. It could be worth rebase this PR on top of that to test your hypothesis. It is unfortunately not finished, but it should be working, there are just some polishing needed in order to merge it. I will try to finish those tomorrow, I won't make it today unfortunately.

emizzle commented 1 week ago

So I've got somewhere on a solution for this, but it's not complete. The first part of the solution is to use the then util to attach callbacks to poll, allowing us to catch exceptions and then propagate them to the synchronous closure.

The second part, which I'm not sure about yet, is understanding whether or not we need to reconnect the RpcHttpClient when this exception is raised (this particular exception only or all exceptions)?

The third part is trying to understand if we need this reconnection logic from part 2 in any other areas of ethers (including websockets).

emizzle commented 1 week ago

@markspanbroek here's my work on keeping the polling loop alive: https://github.com/codex-storage/nim-ethers/tree/fix/keep-polling-alive.

NOTES

  1. We cannot consume futures like we were without asyncSpawning or awaiting them. This is where the .then/.catch comes in: it's effectively an asyncSpawn but does not raise a Defect and instead exposes the exceptions.
  2. We only restart the polling loop when an RpcPostError is encountered
  3. We raise a Defect on other errors, but i'm not so sure this is the right approach. We might need to handle all possible errors differently.
  4. There may be other situations in ethers where this happens?
  5. The then.nim addition was taken from codex and was simplified a bit. It needs to be pulled into its own lib and used in both codex and ethers. The version in this branch is the most recent and will be used for the lib.
markspanbroek commented 1 week ago

here's my work on keeping the polling loop alive

Thanks @emizzle, I think I've now got a version of http polling that works based on these ideas. I also found 4 other bugs that made the tests unstable.

I'll try to get PRs for all of this in tomorrow.

markspanbroek commented 1 week ago

On Linux and MacOS the tests run stable now. The Windows tests are running so slow that they are even failing on the fairly simple contract tests that do not involve proofs.

I'm installing a Windows VM to see if I can reproduce this on my machine.

markspanbroek commented 1 week ago

Moved most of the fixes into #993, and rebased this PR on top of it.

I'm not sure whether it still makes sense to limit the amount of slot queue workers now that I've done some measurements on proof generation. It doesn't take that much time (~3 seconds).