tank: Use healthcheck that relies on RPC

willcl-ark commented 7 months ago

Currently using the PID can still result in a race where the process has started but the RPC is not actually online.

I think this should fix some spurious onion test failures, like this one: https://github.com/bitcoin-dev-project/warnet/actions/runs/6516835067/job/17700750888?pr=93#step:6:2278

vercel[bot] commented 7 months ago

The latest updates on your projects. Learn more about Vercel for Git ↗︎

Name	Status	Preview	Comments	Updated (UTC)
warnet	✅ Ready (Inspect)	Visit Preview	💬 Add feedback	Oct 14, 2023 9:52am

willcl-ark commented 7 months ago

This does obviously put slightly more workload on the bitcoind process (handling RPC requests), perhaps I should also decrease the frequency? The thing I want to avoid is excessively slow startup times, but perhaps guaranteed correct startup is worth it...

pinheadmz commented 7 months ago

Code change makes sense -- who checks Tank health though? Nothing depends_on the tanks, right? Although wait_for_all_tanks_status() could probably use it in the tests since right now it just waits for status:running on the container.

willcl-ark commented 7 months ago

My thinking was that the addpeeraddress and things could wait for proper status, then we could get rid of the (dodgy as ****) exponential_backoff() decorator.

Not sure it's worth another monitoring RPC call though...

Personally I'd still like to rip out the exponential backoff decorator, or at least seriously reduce it's usage in the codebase.

pinheadmz commented 7 months ago

yeah actually in the tests we use wait_for... which isn't exponential but just retries up to a timeout. The tests also have two rpc call methods, one with retry and one without. I agree with you for adding network edges we could just use a 2-second retry on something like getblockcount (like bitcoin functional tests) and once that succeeds, proceed with whatever commands we expect to work

willcl-ark commented 7 months ago

I just feel like having an RPC call method decorated with an auto-retry (ever) is... quite the footgun

pinheadmz commented 7 months ago

well whatever wait_for_rpc_connection() does in bitcoin functional tests

https://github.com/bitcoin-dev-project/warnet/blob/8ee3a85242e644f91c4069e1ad7361146f86c108/src/test_framework/test_node.py#L225-L290

willcl-ark commented 7 months ago

well whatever wait_for_rpc_connection() does in bitcoin functional tests

That seems fine for scenarios.

This healthcheck is for docker daemon to poll the health of each container, and I'm unsure that having it call an rpc on n tanks every m seconds would be insignificant enough not to worry about... Is what I meant in second post above.

Perhaps if we set it to some high number like 60 seconds, it would low enough impact. But then it would give slow startup times, which is worse than current behaviour which just occasionally shows an RPC error in the logs and retrys.

From my previous research docker compose V2 had additional startup params for healthchecks, specifically a few options related to startup like --health-start-period (and a few more, seemingly undocumented now on docker website) to not wait the full period before performing the first healthcheck on startup.

Whereas V3, which we are using, didn't support those at all, which makes setting high values in healthcheck undesirable IMO.

Perhaps the decorator is simpler, and we should just close this out, eh?

bitcoin-dev-project / warnet

tank: Use healthcheck that relies on RPC #143