Open elvece opened 2 years ago
amazing ticket.
This is... quite involved. Are all of these tasks considered P1?
This is... quite involved. Are all of these tasks considered P1?
CC @MattDHill
Yes. The goal is to avoid a barrage of support inquiries and screenshots of crashed service notifications. Even though the errors themselves are benign, the optics are not.
I was more asking about the "additional tasks" section. The top section that mentions looping in the entrypoint is sufficient to stop the error messages.
Agree. Let's make the top portion P1 and the additional tasks P2
c-lightning
Also increase the debounce interval.
When the 3 PR's associated with the first checklist are merged, this ticket can be deescalated to P2.
Still valid?
The P2s here are still relevant, though I made some changes, now that external bitcoin core connections are no longer an option in proxy.
The P2s seem like a subset of a larger ticket for making dependents care about the the health checks of their dependencies. I think we had discussed multiple services that should depend on Bitcoin's health checks. Seems like an audit is in order.
This issue is most apparent with the bitcoin stack. On Embassy restart, all services are started at once. Since bitcoind is not "ready", upstream dependencies (lnd/c-lightning/spark) crash.
A known solution to update the configurator/manager/entrypoint with a looping function around the service binary to account for this bootup delay. This solution should be implemented for the following services, since they are known to crash in this situation:
P1 Tasks
Example solution with lnd:
In a while loop, query bitcoin cli to see its state / if verifying. If bitcoind is not ready yet, continue in loop, ie. dont run the binary for lnd yet. Continue in this loop for some predetermined limit. If bitcoind is syncing / rolling forward, continue in this loop forever. The health checks will say starting for this duration, decide on a time to tell them to fail with some descriptive error message (eg. bitcoind is verifying blocks)
Additional tasks (P2):
Example crash messages: