User can't send payment, always pending

roeierez commented 4 months ago

Relai user tries to send payment that results in pending. The logs shows that there is a timeout waiting for the channel to re-established. Also full logs are attached.

1721292295613.app.log

[BreezSdk] {INFO} (2024-07-18T08:45:36.502607Z) : node-logs: UNUSUAL plugin-gl-plugin-internal: Failing payment with: Timeout while waiting for channels to re-establish\n\nStack backtrace:\n   0: <gl_plugin_internal::node::InternalNodeServer<R> as cln_grpc::pb::node_server::Node>::pay::{{closure}}\n   1: <<cln_grpc::pb::node_server::NodeServer<T> as tower_service::Service<http::request::Request<B>>>::call::PaySvc<T> as tonic::server::service::UnaryService<cln_grpc::pb::PayRequest>>::call::{{closure}}\n   2: <cln_grpc::pb::node_server::NodeServer<T> as tower_service::Service<http::request::Request<B>>>::call::{{closure}}\n   3: <gl_plugin::node::rpcwait::RpcWaitService<S> as tower_service::Service<http::request::Request<hyper::body::body::Body>>>::call::{{closure}}\n   4: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n   5: <futures_util::future::future::map::Map<Fut,F> as core::future::future::Future>::poll\n   6: <tower::util::map_response::MapResponseFuture<F,N> as core::future::future::Future>::poll\n   7: <tonic::transport::service::router::RoutesFuture as core::future::future::Future>::poll\n   8: <gl_plugin::node::SignatureContextService<S> as tower_service::Service<http::request::Request<hyper::body::body::Body>>>::call::{{closure}}\n   9: <tower::util::either::Either<A,B> as core::future::future::Future>::poll\n  10: <tonic::transport::server::recover_error::ResponseFuture<F> as core::future::future::Future>::poll\n  11: <tonic::transport::server::SvcFuture<F> as core::future::future::Future>::poll\n  12: <hyper::proto::h2::server::H2Stream<F,B> as core::future::future::Future>::poll\n  13: tokio::runtime::task::core::Core<T,S>::poll\n  14: tokio::runtime::task::harness::Harness<T,S>::poll\n  15: tokio::runtime::scheduler::multi_thread::worker::Context::run_task\n  16: tokio::runtime::scheduler::multi_thread::worker::Context::run\n  17: tokio::runtime::context::set_scheduler\n  18: tokio::runtime::context::runtime::enter_runtime\n  19: tokio::runtime::scheduler::multi_thread::worker::run\n  20: <tokio::runtime::blocking::task::BlockingTask<T> as core::future::future::Future>::poll\n  21: tokio::runtime::task::core::Core<T,S>::poll\n  22: tokio::runtime::task::harness::Harness<T,S>::poll\n  23: tokio::runtime::blocking::pool::Inner::run\n  24: std::sys_common::backtrace::__rust_begin_short_backtrace\n  25: core::ops::function::FnOnce::call_once{{vtable.shim}}\n  26: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2015:9\n  27: <alloc::boxed::Box<F,A> as core::ops::function::FnOnce<Args>>::call_once\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/alloc/src/boxed.rs:2015:9\n  28: std::sys::unix::thread::Thread::new::thread_start\n             at /rustc/07dca489ac2d933c78d3c5158e3f43beefeb02ce/library/std/src/sys/unix/thread.rs:108:17\n  29: start_thread\n  30: clone

cdecker commented 4 months ago

Yes, we used to treat these as a faillible pre-flight test, but as the error being returned was the usual no route found error, that did not really tell us much, we made it infallible, i.e., it is now required to pass.

The usual cause for these is a slow signer, a CLN <> VLS desync issue, or a slow peer.

Failing here is correct by the way, as we do not stand a chance to complete the payments without enough capacity back online. This just happens to tell us more than the catch-all no route found error, so this is a symptom of us getting closer to the actual root cause :-)

cdecker commented 4 months ago

Tracing through the logs we can reconstruct the following timeline:

2024-07-18T10:38:41+02:00 Node scheduled
2024-07-18T10:38:43+02:00 Startup complete
2024-07-18T10:38:44+02:00 Signer 0 connects
2024-07-18T10:38:45+02:00 Issuing reconnection command
2024-07-18T10:39:31+02:00 Block sync completes (no longer necessary since ElementsProject/lightning#7190)
2024-07-18T10:39:47+02:00 Call to glclient.Node/Pay start of pre-flight checks
2024-07-18T10:39:47+02:00 Checking if channel 841259x3019x0 is ready

So from what we can see here, we either fail to detect a channel becoming active, or we are not re-activating the channel correctly.

I'll dig further.

cdecker commented 4 months ago

If desired we can make the pre-flight infallible again for this node, and see if that helps. That'd mean that the pre-flight checks are not detecting activation correctly.

Longer term we want to make the channel status (active / inactive) an explicit part of listpeerchannels so we can skip some of the guesswork.

roeierez commented 4 months ago

@cdecker I am in favor of anything that will help us improve here. We can update relai that we want to do that for this node if you want to go for it.

Blockstream / greenlight

User can't send payment, always pending #482