Ensure signer doesn't exit on connection change

roeierez commented 1 month ago

We have noticed that when changing a network while greenlight (using breez sdk) is running we no longer able to create an invoice. Digging into this I have noticed the signer exists with these logs:

[2024-10-06 14:26:41.256 DEBUG tonic::codec::decode:204] decoder inner stream error: Status { code: Unknown, message: "error reading a body from connection: address not available", source: Some(hyper::Error(Body, Error { kind: Io(Kind(AddrNotAvailable)) })) }
[2024-10-06 14:26:41.256 DEBUG gl_client::signer:855] Got an error from the scheduler status: Unknown, message: "error reading a body from connection: address not available", details: [], metadata: MetadataMap { headers: {} }
[2024-10-06 14:26:41.256 ERROR gl_client::signer:807] Scheduler signer loop exited unexpectedly: Err(Scheduler stream error Status { code: Unknown, message: "error reading a body from connection: address not available", source: Some(hyper::Error(Body, Error { kind: Io(Kind(AddrNotAvailable)) })) })
[2024-10-06 14:26:41.257 INFO gl_client::signer:812] Exiting the signer loop
[2024-10-06 14:26:41.257 INFO breez_sdk_core::greenlight::node_api:1220] signer exited gracefully

This PR ensures the signer won't exit and will indeed run forever as intended. The same approach of run_forever_inner was applied to run_forever_scheduler

Note: while this prevents the signer from exiting and allow the app to recover there is still a gap of ~3 minutes before the new connected signer is receiving messages. Because I clearly see that after the fix the local signer reconnects successfully I wonder if some keep alive settings on the backend side makes the scheduler "think" the old signer is still connected...

JssDWt commented 1 month ago

Might fix https://github.com/breez/breez-sdk-greenlight/issues/1090 Might fix https://github.com/Blockstream/greenlight/issues/521

cdecker commented 1 month ago

Nice change, thanks @roeierez :+1:

Note: while this prevents the signer from exiting and allow the app to recover there is still a gap of ~3 minutes before the new connected signer is receiving messages. Because I clearly see that after the fix the local signer reconnects successfully I wonder if some keep alive settings on the backend side makes the scheduler "think" the old signer is still connected...

We buffer all pending requests, and will redeliver them to any signer that connects, redundantly, exactly for this case where a signer dies silently. So if the signer isn't immediately getting any pending request on reconnect we need to take a look at it. Generally speaking we let the client drive as much as possible, since it most likely has the most information on its network and connectivity state (power safe mode, network switch, etc), whereas the server side is just always on.

Blockstream / greenlight

Ensure signer doesn't exit on connection change #524