Blockstream / greenlight

Build apps using self-custodial lightning nodes in the cloud
https://blockstream.github.io/greenlight/getting-started/
MIT License
118 stars 27 forks source link

User can't send payment, always pending #482

Open roeierez opened 4 months ago

roeierez commented 4 months ago

Relai user tries to send payment that results in pending. The logs shows that there is a timeout waiting for the channel to re-established. Also full logs are attached.

1721292295613.app.log

[BreezSdk] {INFO} (2024-07-18T08:45:36.502607Z) : node-logs: UNUSUAL plugin-gl-plugin-internal: Failing payment with: Timeout while waiting for channels to re-establish\n\nStack backtrace:\n   0: <gl_plugin_internal::node::InternalNodeServer<R> as cln_grpc::pb::node_server::Node>::pay::{{closure}}\n   1: <<cln_grpc::pb::node_server::NodeServer<T> as tower_service::Service<http::request::Request<B>>>::call::PaySvc<T> as tonic::server::service::UnaryService<cln_grpc::pb::PayRequest>>::call::{{closure}}\n   2: <cln_grpc::pb::node_server::NodeServer<T> as tower_service::Service<http::request::Request<B>>>::call::{{closure}}\n   3: <gl_plugin::node::rpcwait::RpcWaitService<S> as tower_service::Service<http::request::Request<hyper::body::body::Body>>>::call::{{closure}}\n   4: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n   5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n   6: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll\n   7: <tonic::transport::service::router::RoutesFuture as core::future::future::Future>::poll\n   8: <gl_plugin::node::SignatureContextService<S> as tower_service::Service<http::request::Request<hyper::body::body::Body>>>::call::{{closure}}\n   9: <tower::util::either::Either<A,B> as core::future::future::Future>::poll\n  10: <tonic::transport::server::recover_error::ResponseFuture<F> as core::future::future::Future>::poll\n  11: <tonic::transport::server::SvcFuture<F> as core::future::future::Future>::poll\n  12: <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll\n  13: tokio::runtime::task::core::Core<T,S>::poll\n  14: tokio::runtime::task::harness::Harness<T,S>::poll\n  15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\n  16: tokio::runtime::scheduler::multi_thread::worker::Context::run\n  17: tokio::runtime::context::set_scheduler\n  18: tokio::runtime::context::runtime::enter_runtime\n  19: tokio::runtime::scheduler::multi_thread::worker::run\n  20: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n  21: tokio::runtime::task::core::Core<T,S>::poll\n  22: tokio::runtime::task::harness::Harness<T,S>::poll\n  23: tokio::runtime::blocking::pool::Inner::run\n  24: std::sys_common::backtrace::__rust_begin_short_backtrace\n  25: core::ops::function::FnOnce::call_once{{vtable.shim}}\n  26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2015:9\n  27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2015:9\n  28: std::sys::unix::thread::Thread::new::thread_start\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys/unix/thread.rs:108:17\n  29: start_thread\n  30: clone
cdecker commented 4 months ago

Yes, we used to treat these as a faillible pre-flight test, but as the error being returned was the usual no route found error, that did not really tell us much, we made it infallible, i.e., it is now required to pass.

The usual cause for these is a slow signer, a CLN <> VLS desync issue, or a slow peer.

Failing here is correct by the way, as we do not stand a chance to complete the payments without enough capacity back online. This just happens to tell us more than the catch-all no route found error, so this is a symptom of us getting closer to the actual root cause :-)

cdecker commented 4 months ago

Tracing through the logs we can reconstruct the following timeline:

So from what we can see here, we either fail to detect a channel becoming active, or we are not re-activating the channel correctly.

I'll dig further.

cdecker commented 4 months ago

If desired we can make the pre-flight infallible again for this node, and see if that helps. That'd mean that the pre-flight checks are not detecting activation correctly.

Longer term we want to make the channel status (active / inactive) an explicit part of listpeerchannels so we can skip some of the guesswork.

roeierez commented 4 months ago

@cdecker I am in favor of anything that will help us improve here. We can update relai that we want to do that for this node if you want to go for it.