CI: `blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20` occationally fails

evanlinjin commented 2 years ago

Describe the bug

CI: blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20 occationally fails.

The command that fails

$ cargo test --no-default-features --features test-esplora,use-esplora-ureq,verify esplora::bdk_blockchain_tests

Occurences

https://github.com/bitcoindevkit/bdk/runs/7163681328?check_suite_focus=true

rajarshimaitra commented 2 years ago

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

evanlinjin commented 2 years ago

I have been witnessing few occasional failures in blockchain tests in my local too.. Seems like some kind of resource blocking thingy happening somewhere?? I was seeing "resource unavailable" errors few days ago when running blockchian tests all together.. Which I can't seem to reproduce anymore.. Could be something related??

From the error message it seems something got timed out while waiting for some stuffs..

Same here. Would it make sense to implement some retry logic/exponential backoff?

We can keep track of the errors and where they occur here.

rajarshimaitra commented 2 years ago

Same here. Would it make sense to implement some retry logic/exponential backoff?

We do have an exponential pole at electrum calls https://github.com/bitcoindevkit/bdk/blob/063d51fd7587815c733f24365a4de02fed072903/src/testutils/blockchain_tests.rs#L345-L359

To me it seems for some reason the electrum calls isn't returning and some resources are not being free??

The error seems to disappear in single test runs.. Only when I run all the blockchian tests together..

afilini commented 2 years ago

This looks like a bug in electrsd:

[2022-07-02T18:30:55Z ERROR electrsd] early exit with: ExitStatus(unix_wait_status(134))

Maybe it doesn't handle too much load well. I'll see if I can somehow attach a debugger and reproduce.

evanlinjin commented 2 years ago

This is a one-off failure, not sure if it is related.

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_double_spend stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_double_spend' panicked at 'called `Result::unwrap()` on an `Err` value: JSON(Error("expected value", line: 1, column: 1))', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    blockchain::esplora::bdk_blockchain_tests::test_double_spend

afilini commented 2 years ago

Maybe it's related to this? https://github.com/bitcoindevkit/rust-electrum-client/issues/67

evanlinjin commented 2 years ago

failures:

---- blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    blockchain::esplora::bdk_blockchain_tests::test_sync_address_index_should_not_decrement

test result: FAILED. 256 passed; 1 failed; 5 ignored; 0 measured; 0 filtered out; finished in 6.41s

evanlinjin commented 2 years ago

 failures:
error: test failed, to rerun pass '--lib'
---- blockchain::esplora::bdk_blockchain_tests::test_tx_chain stdout ----
thread 'blockchain::esplora::bdk_blockchain_tests::test_tx_chain' panicked at 'called `Result::unwrap()` on an `Err` value: EarlyExit(ExitStatus(unix_wait_status(134)))', src/testutils/blockchain_tests.rs:36:75
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
failures:
    blockchain::esplora::bdk_blockchain_tests::test_tx_chain

evanlinjin commented 2 years ago

---- blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic stdout ----
thread 'blockchain::electrum::test::bdk_blockchain_tests::test_sync_bump_fee_basic' panicked at 'called `Result::unwrap()` on an `Err` value: IOError(Os { code: 104, kind: ConnectionReset, message: "Connection reset by peer" })', src/testutils/blockchain_tests.rs:65:56
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

evanlinjin commented 2 years ago

This one is the craziest yet: https://gist.github.com/evanlinjin/3079251115e64edc355f46d0d63edb1e

rajarshimaitra commented 2 years ago

Yes.. I have been seeing this one in my local..

evanlinjin commented 2 years ago

Yes.. I have been seeing this one in my local..

I mean if the error states "Resource TEMPORARILY unavailable", I think we can safely wrap the calls with exponential back-off logic. I did see a function in the tests for exponential back-off, so maybe we can reuse that.

afilini commented 2 years ago

My guess: the tcp send buffer gets filled up because core can't process all those requests at once. Increasing the number of threads would probably help.

Source https://linux.die.net/man/2/send:

When the message does not fit into the send buffer of the socket, send() normally blocks, unless the socket has been placed in nonblocking I/O mode. In nonblocking mode it would fail with the error EAGAIN or EWOULDBLOCK in this case

notmandatory commented 2 years ago

I ran into this issue testing bitcoindevkit/bdk-cli#102 with the new regtest-bitcoin feature which starts up a local bitcoind and connects to it via the RPC client, but the error only seemed to happen on my MacOS system, the CI tests all pass on Ubuntu Linux. While debugging I noticed in the bitcoind logs a list of RPC requests to "Update existing descriptor: raw(...)" seems to be the default 100 address look ahead, so I think the requests are all getting successfully delivered but the RPC client connection disconnects always after 5 seconds and throws the SocketError "Resource temporarily unavailable". I'll try making the bitcoind server responded faster (as @afilini suggested with more threads), and separately see if I can make the RPC client timeout longer and report back.

notmandatory commented 2 years ago

I re-tried my test with more rpcthreads set on bitcoind with conf set to via "-rpcthread=", I tired 20 and 50, didn't seem to make a difference. I haven't find a way yet to change the RPC call socket timeout.

notmandatory commented 2 years ago

Probably best to disregard my above RPC problems since the title of this issue is really about esplora. I'll open a new issue for my RPC macos problem.

notmandatory commented 1 year ago

Could this issue be related to the default number for file descriptors running out? This is mentioned for core here: JSON-RPC-interface limitations.

rajarshimaitra commented 1 year ago

As per https://docs.oracle.com/cd/E19476-01/821-0505/file-descriptor-requirements.html, the default limit for file descriptors per process is 1024. I am not fully sure, but I think we are not hitting that many RPC requests.

Another pointer from LDK team was, it should just redo the call again if it hits the error. https://github.com/lightningdevkit/rust-lightning/pull/2033#issuecomment-1432018010

I will try with this on the new RPC example and report back..

bitcoindevkit / bdk

CI: `blockchain::esplora::bdk_blockchain_tests::test_sync_stop_gap_20` occationally fails #650