Start9Labs / community-services

4 stars 0 forks source link

Do not display crash warnings on restart if reason for crash known #9

Open elvece opened 2 years ago

elvece commented 2 years ago

This issue is most apparent with the bitcoin stack. On Embassy restart, all services are started at once. Since bitcoind is not "ready", upstream dependencies (lnd/c-lightning/spark) crash.

A known solution to update the configurator/manager/entrypoint with a looping function around the service binary to account for this bootup delay. This solution should be implemented for the following services, since they are known to crash in this situation:

P1 Tasks

Example solution with lnd:

In a while loop, query bitcoin cli to see its state / if verifying. If bitcoind is not ready yet, continue in loop, ie. dont run the binary for lnd yet. Continue in this loop for some predetermined limit. If bitcoind is syncing / rolling forward, continue in this loop forever. The health checks will say starting for this duration, decide on a time to tell them to fail with some descriptive error message (eg. bitcoind is verifying blocks)

Additional tasks (P2):

Example crash messages:

lnd - Service Crashed
The service lnd has crashed with the following exit code: 1 Details: unable to create chain control: -28: Verifying blocks...

spark-wallet - Service Crashed
The service spark-wallet has crashed with the following exit code: 1 Details: uncaughtException, stopping process Error: connect ECONNREFUSED /mnt/c-lightning/shared/lightning-rpc at PipeConnectWrap.afterConnect [as oncomplete] (node:net:1157:16) 

c-lightning - Service Crashed
The service c-lightning has crashed with the following exit code: 1 Details: Could not connect to bitcoind using bitcoin-cli. Is bitcoind running? Make sure you have bitcoind running and that bitcoin-cli is able to connect to bitcoind. You can verify that your Bitcoin Core installation is ready for use by running: $ bitcoin-cli -rpcconnect=btc-rpc-proxy.embassy -rpcport=8332 -rpcuser=... -rpcpassword=... echo 'hello world' The Bitcoin backend died. 

c-lightning - Service Crashed
... crashed with the following exit code: 134 Details: Serving RPC on 0.0.0.0:8080 Generating a RSA private key ..................................+++++ .............+++++ writing new private key to './certs/key.tmp.pem' ----- writing RSA key bitcoin-cli -rpcconnect=btc-rpc-proxy.embassy -rpcport=8332 -rpcuser=... -rpcpassword=... getblockhash 719503 exited 1 (after 1 other errors) 'error: Could not connect to the server btc-rpc-proxy.embassy:8332 (error code 0 - "timeout reached") Make sure the bitcoind server is running and that you are connecting to the correct RPC port. '; we have been retrying command for --bitcoin-retry-timeout=60 seconds; bitcoind setup or our --bitcoin-* configs broken? The Bitcoin backend died. lightningd: FATAL SIGNAL 6 (version v0.10.2-modded) 0xaaaade44feb7 send_backtrace common/daemon.c:33 0xaaaade44ff57 crashdump common/daemon.c:46 0xffffa3555a6b ??? ???:0 0xffffa3555af4 ??? ???:0 Log dumped in crash.log.20220119203932 Lost connection to the RPC socket. 
ProofOfKeags commented 2 years ago

amazing ticket.

chrisguida commented 2 years ago

This is... quite involved. Are all of these tasks considered P1?

ProofOfKeags commented 2 years ago

This is... quite involved. Are all of these tasks considered P1?

CC @MattDHill

MattDHill commented 2 years ago

Yes. The goal is to avoid a barrage of support inquiries and screenshots of crashed service notifications. Even though the errors themselves are benign, the optics are not.

chrisguida commented 2 years ago

I was more asking about the "additional tasks" section. The top section that mentions looping in the entrypoint is sufficient to stop the error messages.

MattDHill commented 2 years ago

Agree. Let's make the top portion P1 and the additional tasks P2

chrisguida commented 2 years ago

c-lightning

ProofOfKeags commented 2 years ago

Also increase the debounce interval.

ProofOfKeags commented 2 years ago

When the 3 PR's associated with the first checklist are merged, this ticket can be deescalated to P2.

MattDHill commented 2 years ago

Still valid?

dr-bonez commented 2 years ago

The P2s here are still relevant, though I made some changes, now that external bitcoin core connections are no longer an option in proxy.

MattDHill commented 2 years ago

The P2s seem like a subset of a larger ticket for making dependents care about the the health checks of their dependencies. I think we had discussed multiple services that should depend on Bitcoin's health checks. Seems like an audit is in order.