fedimint / fedimint

Federated E-Cash Mint
https://fedimint.org/
MIT License
536 stars 209 forks source link

Timing tweaks in shell scripts #838

Closed okjodom closed 1 year ago

okjodom commented 1 year ago

Starting the gateway CLN plugin runs the gateway. On run, the gateway attempts to register with a federation. If the federation is not ready, the gateway registration call fails, which in turn would fail the CLN plugin start.

The recent change to encrypt configs at rest (#808) marginally slows down the federation start, which then leads into the situation described above. As a temporary fix, this PR tweaks our sleep timers in the shell scripts we use to orchestrate federation and gateway start by:

Follow Ups:

dpc commented 1 year ago

The negative part about fixed delays is that they slow everything down.

Would adding some bounded retries to the code doing connection be a possibility? I think we are going to need them anyway.

okjodom commented 1 year ago

Would adding some bounded retries to the code doing connection be a possibility? I think we are going to need them anyway.

Making an attempt now. It's true we'd need a retry/backoff on this registration even when we move it out of the critical path

dpc commented 1 year ago

Making an attempt now. It's true we'd need a retry/backoff on this registration even when we move it out of the critical path

In my experience most of the time a system needs a highest level retry mechanism (https://en.wikipedia.org/wiki/End-to-end_principle) for correctness and lowest-level retry mechanism for performance and practical reasons (tolerating short disruptions without causing the whole process to fail and be retried at the high level).

BTW. the original paper is short and a fun read https://web.mit.edu/Saltzer/www/publications/endtoend/endtoend.pdf

okjodom commented 1 year ago

making an attempt now

Some quick lessons:

@dpc, I wan't to defer implementation of retry for now, and do a follow up later. Do it proper Fow now, I'll jump back to other high pri WIs, if we can take a temporary hit from these timing tweaks to slow things down a bit

dpc commented 1 year ago

The @okjodom the simplest retry of a future is a loop in a timeout:

https://github.com/fedimint/fedimint/pull/834/files#diff-518eaa39776f4437cb64e868e4a31b14a31449a3e6e864e2e4048b8b8a591454R456

This link seems to not work very well:

    pub async fn await_consensus_block_height(
        &self,
        block_height: u64,
    ) -> std::result::Result<u64, task::Elapsed> {
        task::timeout(Duration::from_secs(30), async {
            self.await_consensus_block_height_inner(block_height).await
        })
        .await
    }

    async fn await_consensus_block_height_inner(&self, block_height: u64) -> u64 {
        loop {
            match self.context.api.fetch_consensus_block_height().await {
                Ok(height) if height >= block_height => return height,
jkitman commented 1 year ago

It's true we'd need a retry/backoff on this registration even when we move it out of the critical path

@okjodom Once it's out of the start-up path, we might want it to fail fast if the fed isn't available (since we should expect it to be). We could potentially call await_block_sync in the shell instead of sleeping to ensure the fed has started successfully.

okjodom commented 1 year ago

we might want it to fail fast if the fed isn't available (since we should expect it to be).

Interesting. A register call that fails might indicate to the Gateway that the fed is not a reliable business partner. On the other hand, the Gateway probably has done some 'checks' before-hand, and is now convinced of the Feds reliability, such that it willing to retry the registration.

okjodom commented 1 year ago

We could potentially call await_block_sync in the shell instead of sleeping to ensure the fed has started successfully.

Seems like a big hack in this case. I see the usage of await_block_sync and all those seem to be in places where we care about the blockchain state. The gateway should not dictate any other kind of federation state (other than 'aliveness') before it can register itself with with the federation.

dpc commented 1 year ago

(since we should expect it to be).

Networks are always unreliable. Mobile users switch from wifi to GSM. A 10 or 15s retry&timeout seems a good default in almost all circumstances.

okjodom commented 1 year ago

@okjodom the simplest retry of a future is a loop in a timeout: https://github.com/fedimint/fedimint/pull/834/files#diff-518eaa39776f4437cb64e868e4a31b14a31449a3e6e864e2e4048b8b8a591454R456

Good pointer, though I couldn't find a way of declaring my timeout for the registration 'task'. As an aside, I like the task based pattern proposed in that PR.

@dpc , I went ahead and implemented a generic retry utility. However, I'm still not able to rely on it to resolve my gateway registration task because I haven't found a pairing for wait Duration and max_retry count that ensures the gateway registration task finds a ready federation. I suspect it's because CLN plugin will crap out on any std write (different problem if true) but really challenging to debug.

Some that I will look at later:

# Successful CLN start + GW plugin start, BUT with unsuccessful registration to federation because fed is not ready

$ lightningd --dev-fast-gossip --dev-bitcoind-poll=1 --network regtest --bitcoin-rpcuser=bitcoin --bitcoin-rpcpassword=bitcoin --lightning-dir=$FM_LN1_DIR --addr=127.0.0.1:9000 --plugin=$FM_BIN_DIR/ln_gateway --fedimint-cfg=$FM_CFG_DIR
2022-10-25T01:21:59.099Z INFO    lightningd: Creating configuration directory /tmp/nix-shell.bw0kZE/tmp.dNfCDkRPe8/ln1/regtest
2022-10-25T01:21:59.455Z INFO    lightningd: Creating database
2022-10-25T01:21:59.551Z UNUSUAL hsmd: HSM: created new hsm_secret file
2022-10-25T01:21:59.668Z INFO    plugin-bcli: bitcoin-cli initialized and connected to bitcoind.
2022-10-25T01:21:59.671Z UNUSUAL lightningd: Waiting for initial block download (this can take a while!)
2022-10-25T01:21:59.715Z INFO    lightningd: --------------------------------------------------
2022-10-25T01:21:59.715Z INFO    lightningd: Server started with public key 03b453b004c698a1800b045ba9bf367e47346c9f615fe6a06b7f54a260588a2981, alias SOMBERSPORK-v0.11.1 (color #03b453) and lightningd v0.11.1
2022-10-25T01:21:59.907Z INFO    plugin-ln_gateway: request; peer=0
2022-10-25T01:21:59.908Z INFO    plugin-ln_gateway: request; peer=1
thread 'main' panicked at 'Failed to register with federation: MintApiError(RpcError(Transport(Error when opening the TCP socket: Connection refused (os error 111))))', /home/okj/fedimint/fedimint/gateway/ln-gateway/src/lib.rs:400:22
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
2022-10-25T01:21:59.909Z INFO    plugin-ln_gateway: request; peer=2
2022-10-25T01:21:59.909Z INFO    plugin-ln_gateway: request; peer=3
2022-10-25T01:21:59.910Z **BROKEN** plugin-ln_gateway: unable to connect to server err=Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2022-10-25T01:21:59.911Z **BROKEN** plugin-ln_gateway: unable to connect to server err=Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2022-10-25T01:21:59.912Z **BROKEN** plugin-ln_gateway: unable to connect to server err=Networking or low-level protocol error: Error when opening the TCP socket: Connection refused (os error 111)
2022-10-25T01:22:04.392Z UNUSUAL lightningd: Bitcoin backend now synced.
2022-10-25T01:22:04.874Z INFO    lightningd: Peer says it sees our address as: 127.0.0.1:53466
2022-10-25T01:22:06.403Z INFO    03c2079f5edc33e62087be02b06da0950f5a83407f5b7afb8abcdc10d44489a897-chan#1: State changed from CHANNELD_AWAITING_LOCKIN to CHANNELD_NORMAL
# Successful CLN start + GW plugin start, and registration to federation

$ lightningd --dev-fast-gossip --dev-bitcoind-poll=1 --network regtest --bitcoin-rpcuser=bitcoin --bitcoin-rpcpassword=bitcoin --lightning-dir=$FM_LN1_DIR --addr=127.0.0.1:9000 --plugin=$FM_BIN_DIR/ln_gateway --fedimint-cfg=$FM_CFG_DIR
2022-10-25T01:04:34.490Z INFO    lightningd: Creating configuration directory /tmp/nix-shell.bw0kZE/tmp.OBqQiOwvoN/ln1/regtest
2022-10-25T01:04:34.858Z INFO    lightningd: Creating database
2022-10-25T01:04:35.038Z UNUSUAL hsmd: HSM: created new hsm_secret file
2022-10-25T01:04:35.426Z INFO    plugin-bcli: bitcoin-cli initialized and connected to bitcoind.
2022-10-25T01:04:35.483Z INFO    lightningd: --------------------------------------------------
2022-10-25T01:04:35.483Z INFO    lightningd: Server started with public key 02f32fcc9e50d220e36019a8cf3d59258d54c0d100d466fb81ab6749e018adbf94, alias REDPENGUIN-v0.11.1 (color #02f32f) and lightningd v0.11.1
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: request; peer=0
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: request; peer=1
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: request; peer=2
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: request; peer=3
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: Connection established to target: Target { sockaddrs: [], host: \"127.0.0.1\", host_header: \"127.0.0.1:4021\", _mode: Plain, path_and_query: \"/\" }
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: Connection established to target: Target { sockaddrs: [], host: \"127.0.0.1\", host_header: \"127.0.0.1:4031\", _mode: Plain, path_and_query: \"/\" }
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: Connection established to target: Target { sockaddrs: [], host: \"127.0.0.1\", host_header: \"127.0.0.1:4001\", _mode: Plain, path_and_query: \"/\" }
2022-10-25T01:04:35.740Z INFO    plugin-ln_gateway: Connection established to target: Target { sockaddrs: [], host: \"127.0.0.1\", host_header: \"127.0.0.1:4011\", _mode: Plain, path_and_query: \"/\" }
2022-10-25T01:04:45.132Z INFO    lightningd: Peer says it sees our address as: 127.0.0.1:59024
2022-10-25T01:04:47.446Z INFO    02c21e35641d0812046668cbd3355f17a7b71983856fcfbf051e5c7f16ce49c211-chan#1: State changed from CHANNELD_AWAITING_LOCKIN to CHANNELD_NORMAL
...
jkitman commented 1 year ago

Networks are always unreliable.

True, but that should be handled at the network layer. The client will establish a wss connection then wait for a success from a threshold of peers. If we cannot establish a connection to at least the threshold number of peers then likely the federation is down or the network is down. Better to give the gateway user fast feedback when they cannot register their gateway IMO.

Since we are dealing with a federation there's also different strategies we can choose depending on the request. See https://github.com/fedimint/fedimint/blob/master/client/client-lib/src/api.rs#L282-L289

We should eventually make it configurable for clients to pass in different strategies to the same API call depending on the use. That's how we could handle a retry in the future.

okjodom commented 1 year ago

@jkitman , @dpc , I have a WIP pr which attempts retries from within the gateways context. I'll catch up later on what direction we want to take with the retry pattern.

For now, I am proposing we proceed with the timing tweaks to unblock higher pri GW work items. Note: This PR slows down tmuxinator shell script by 15s

dpc commented 1 year ago

Engaged auto-merge, and restarted failing build.

codecov-commenter commented 1 year ago

Codecov Report

Base: 58.23% // Head: 58.22% // Decreases project coverage by -0.01% :warning:

Coverage data is based on head (ed8963d) compared to base (67b8510). Patch has no changes to coverable lines.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #838 +/- ## ========================================== - Coverage 58.23% 58.22% -0.02% ========================================== Files 101 101 Lines 16293 16293 ========================================== - Hits 9489 9487 -2 - Misses 6804 6806 +2 ``` | [Impacted Files](https://codecov.io/gh/fedimint/fedimint/pull/838?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint) | Coverage Δ | | |---|---|---| | [gateway/ln-gateway/src/lib.rs](https://codecov.io/gh/fedimint/fedimint/pull/838/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint#diff-Z2F0ZXdheS9sbi1nYXRld2F5L3NyYy9saWIucnM=) | `38.72% <ø> (ø)` | | | [fedimint-server/src/net/peers.rs](https://codecov.io/gh/fedimint/fedimint/pull/838/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint#diff-ZmVkaW1pbnQtc2VydmVyL3NyYy9uZXQvcGVlcnMucnM=) | `90.02% <0.00%> (-1.47%)` | :arrow_down: | | [fedimint-api/src/core.rs](https://codecov.io/gh/fedimint/fedimint/pull/838/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint#diff-ZmVkaW1pbnQtYXBpL3NyYy9jb3JlLnJz) | `72.45% <0.00%> (-0.28%)` | :arrow_down: | | [fedimint-server/src/net/connect.rs](https://codecov.io/gh/fedimint/fedimint/pull/838/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint#diff-ZmVkaW1pbnQtc2VydmVyL3NyYy9uZXQvY29ubmVjdC5ycw==) | `95.26% <0.00%> (+0.24%)` | :arrow_up: | | [client/client-lib/src/query.rs](https://codecov.io/gh/fedimint/fedimint/pull/838/diff?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint#diff-Y2xpZW50L2NsaWVudC1saWIvc3JjL3F1ZXJ5LnJz) | `94.73% <0.00%> (+3.15%)` | :arrow_up: | Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=fedimint)

:umbrella: View full report at Codecov.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.