RPC for settling or canceling intercepted HTLCs after gateway processing

okjodom commented 1 year ago

Improve upon GatewayLightning subscribe_intercept_htlcs by defining a corresponding complete-htlc client streaming rpc for responding to intercepted and processed htlcs

For every new federation connected, the gateway sends a subscribe_intercept_htlcs RPC to request for HTLC interception. This opens a stream over which intercepted HTLCs are sent to the client:
```
// example
let subscription = SubscribeInterceptHtlcsRequest { short_channel_id: u64 };
let stream = client.subscribe_intercept_htlcs(Request::new(subscription)).await?;
```

For every HTLC intercepted and processed by the gateway:

Use complete_htlcs RPC to send a request to settle the HTLC if HTLC processing was successful

// example
let settle = CompleteHtlcsRequest { action: Some(Action::Settle(Settle { preimage: [] })) };
let stream = client.complete_htlcs(Request::new(stream::iter(vec![settle])).await?;

Use complete_htlcs RPC to send a request to cancel the HTLC if HTLC processing failed, timed-out or was rejected for some other reason

// example
let cancel = CompleteHtlcsRequest { action: Some(Action::Cancel(Cancel { reason: "" })) };
let stream = client.complete_htlcs(Request::new(stream::iter(vec![cancel])).await?;

Use complete_htlcs RPC in batch mode

// example
let settle = CompleteHtlcsRequest { action: Some(Action::Settle(Settle { preimage: [] })) };
let cancel = CompleteHtlcsRequest { action: Some(Action::Cancel(Cancel { reason: "" })) };
let stream = client.complete_htlcs(Request::new(stream::iter(vec![settle, cancel])).await?;

justinmoon commented 1 year ago

- Use complete_htlcs RPC to send a request to cancel the HTLC if HTLC processing failed, timed-out or was rejected for some other reason
  // example
  let cancel = CompleteHtlcsRequest { action: Some(Action::Cancel(Cancel { reason: "" })) };
  let stream = client.complete_htlcs(Request::new(stream::iter(vec![cancel])).await?;

Did you mean cancel_htlcs instead of complete_htlcs?

elsirion commented 1 year ago

I haven't looked at this PR in particular yet, but could we implement something compatible with LND?

https://github.com/lightningnetwork/lnd/blob/ef8cf73d4ce4b485d47ded5fa219992574c928c1/lnrpc/routerrpc/router.proto#L134-L142

That would avoid needing different proto specs and thus client impls in the GW for LND and CLN.

Even if not possible, we should at least think about how we unify the two proto specs into one rust trait.

okjodom commented 1 year ago

I haven't looked at this PR in particular yet, but could we implement something compatible with LND?

https://github.com/lightningnetwork/lnd/blob/ef8cf73d4ce4b485d47ded5fa219992574c928c1/lnrpc/routerrpc/router.proto#L134-L142

Yes, we actually abstract over this LND HtlcInterceptor RPC, adding some filtering feature so which allows us to subscribe to htlcs for multiple federations.

That would avoid needing different proto specs and thus client impls in the GW for LND and CLN. Even if not possible, we should at least think about how we unify the two proto specs into one rust trait.

Filtering happens server side within the gateway-<lnrpc>-extension so each implementation does it's own mapping into our own RPC spec defined in this PR

As for the contents of this PR vs previous proposal in #1218 , I'm convinced we need one subscribe_intercept_htlcs RPC method for intercepting and a corresponding complete_htlcs RPC method for responding to the node. This separation or intercept and respond RPC methods is because gateway-<lnrpc>-extension delegates processing of intercepted HTLC over the wire to gatewayd via lnrpc-client, and it should not block on waiting for a response from the client

In contrast, LND HtlcInterceptor is a bi-directional streaming RPC, because we can serially manipulate each intercepted htlc and send back responses to LND over a channel. No filtering on the raw interceptor.

Here is an example of intercept and respond to HTLCs built with inspiration from prior art by @justinmoon

okjodom commented 1 year ago

- Use complete_htlcs RPC to send a request to cancel the HTLC if HTLC processing failed, timed-out or was rejected for some other reason
// example
let cancel = CompleteHtlcsRequest { action: Some(Action::Cancel(Cancel { reason: "" })) };
let stream = client.complete_htlcs(Request::new(stream::iter(vec![cancel])).await?;

Did you mean cancel_htlcs instead of complete_htlcs?

complete_htlcs is the rpc method name. You can send a cancel or settle action request over it, specifying the htlc to cancel or settle

Here is a complete implementation of the client: https://github.com/fedimint/fedimint/pull/1198/commits. These two commits for how the lnrpc client uses complete_htlcs:

justinmoon commented 1 year ago

This separation or intercept and respond RPC methods is because gateway--extension delegates processing of intercepted HTLC over the wire to gatewayd via lnrpc-client, and it should not block on waiting for a response from the client

For CLN with cln_plugin we have to block, right? We're just inside a function and need to return a value. I guess we could just println the JSON-RPC response to CLN if we can get our hands on the JSON-RPC request ID. But blocking isn't a problem if we do it in a tokio task so it doesn't block anything else.

Filtering happens server side within the gateway--extension so each implementation does it's own mapping into our own RPC spec defined in this PR

Filtering could also happen in gatewayd, right? Not sure which is better, but strictly speaking it can happen in either one ...

okjodom commented 1 year ago

This separation or intercept and respond RPC methods is because gateway--extension delegates processing of intercepted HTLC over the wire to gatewayd via lnrpc-client, and it should not block on waiting for a response from the client

For CLN with cln_plugin we have to block, right? We're just inside a function and need to return a value.

Ah yes. CLN implementation has to block! It'd be interesting to implement and interceptor for CLN which can handle this situation:

1. Intercept HTLC A for federation A. Send over to gatewayd for processing
2. Intercept HTLC B for federation B. Send over to gatewayd for processing
3. Receive outcome of HTLC A from gatewayd and respond to CLN pay
4. Receive outcome of HTLC B from gatewayd and respond to CLN pay

Desirable properties of a good interceptor might include:

intercept and delegate processing of multiple HTLCs for the same federation in parallel
intercept and delegate processing of multiple HTLCs for different federations in parallel
auto cancel HTLCs delegated to gatewayd for processing if there's no response after some expiry

I guess we could just println the JSON-RPC response to CLN if we can get our hands on the JSON-RPC request ID. But blocking isn't a problem if we do it in a tokio task so it doesn't block anything else.

Trying this out here: https://github.com/fedimint/fedimint/pull/1224/commits

Filtering happens server side within the gateway--extension so each implementation does it's own mapping into our own RPC spec defined in this PR

Filtering could also happen in gatewayd, right? Not sure which is better, but strictly speaking it can happen in either one ...

Filtering within gateway-<lnrpc>-extension allows gatewayd to be dumb about the HTLCs type definition, which vary per LN implementation. Additionally, since we already map HTLCs into some Fedimint gateway known types, we might as well pre-filter within the interceptor and only send relevant HTLCs to gatewayd for processing, while immediately continuing [or canceling] irrelevant HTLCs

justinmoon commented 1 year ago

Filtering within gateway--extension allows gatewayd to be dumb about the HTLCs type definition, which vary per LN implementation

You could just have the subscription event include the short channel id. Wouldn't need to know anything else AFAIK.

justinmoon commented 1 year ago

If we used the same proto definitions as LND (and had the gateway filter HTLCs) would there be any need for the LND extension? Couldn't gatewayd just connect directly to LND?

But having LND define this interface this might not work for very long. For instance, the extension will eventually need a subscription to intercept p2p messages requesting a bolt12 invoice from offer. I imagine CLN will support such a hook much sooner than LND will ...

elsirion commented 1 year ago

Having a service run between LND and gatewayd seems like a lot of hassle (additional config) and another point of failure. It's bad enough we already need it for CLN.

So what if we had a trait in gatewayd that specifies the interface we need and then implement that trait both for the CLN protoc client and the LND protoc client? That way we'd avoid the additional service in the LND case. That rust trait impl could still do filtering etc.

okjodom commented 1 year ago

Filtering within gateway--extension allows gatewayd to be dumb about the HTLCs type definition, which vary per LN implementation

You could just have the subscription event include the short channel id. Wouldn't need to know anything else AFAIK.

For CLN, you'd get HtlcAccepted data on intercept, append short short channel id to this, and then send it over network to gatewayd. For LND, you'd get ForwardHtlcInterceptResponse data on intercept, append short short channel id to this, and then send it over network to gatewayd.

send it over network to gatewayd already requires type coercion because we don't want to send raw HtlcAccepted or ForwardHtlcInterceptResponse types over network wire to gatewayd. Doesn't scale well across LN implementations.
even if we could send it over network to gatewayd without type coercion, we'd then need gatewayd to understand HtlcAccepted or ForwardHtlcInterceptResponse or other type X defined by a specific node implementation
send it over network to gatewayd without filtering doesn't scale well with volume of HTLCs intercepted. imagine sending 100 HTLCs to gatewayd for filtering and processing, only to realize just 10 of them are relevant to gatewayd federation. gatewayd federation needs to process those. Yet still it needs to send responses to 100 HTLCs with actions like settle, cancel or continue as the outcome of the round trip

Pre-filtering HTLCs within the gateway-<lnrpc>-extension service fixes this :)

okjodom commented 1 year ago

But having LND define this interface this might not work for very long. For instance, the extension will eventually need a subscription to intercept p2p messages requesting a bolt12 invoice from offer. I imagine CLN will support such a hook much sooner than LND will ...

Yeah it's best we as Fedimint own the rpc spec. We then should aim to stabilize it so any great implementation of the gateway-<lnrpc>-extension can be relatively stable and evenually take on a slower upgrade cadence to gatewayd and the rest of Fedimint.

This fully decouples service upgrade cadence of the node from that of gatewayd
Democratizes implementations of gateway-<lnrpc>-extension

Example, a node runner can opt to deploy their gateway-<lnrpc>-extension implementation with extremely efficient htlc intercept and filter on behalf of federaions. Eclair experts can implement gateway-eclair-extension independent of our work at improving gatewayd, the contracts, features and security or providing gateway services

okjodom commented 1 year ago

Having a service run between LND and gatewayd seems like a lot of hassle (additional config) and another point of failure.

Is this not the regular pain of microservice architecture? Except in this case we get tremendous benefits from having gatewayd and gateway-<lnrpc>-extension be two independent, interoperating microservices

It's bad enough we already need it for CLN.

I don't see any 'additional badness' for LND other than the pain we were already going to meet even with . The architecture of LND requires a service running parallel and outside of lnd binary process.

So what if we had a trait in gatewayd that specifies the interface we need and then implement that trait both for the CLN protoc client and the LND protoc client?...

Not sure I understand this first part of the new suggestion

That way we'd avoid the additional service in the LND case. That rust trait impl could still do filtering etc.

and so the second part is lost to me as well...

justinmoon commented 1 year ago

I don't see any 'additional badness' for LND other than the pain we were already going to meet even with . The architecture of LND requires a service running parallel and outside of lnd binary process.

LND requires us to run 1 additional process. We're planning to run 2 additional processes (gatewayd and extension).

Not sure I understand this first part of the new suggestion

The suggestion is to have gatewayd do direct RPC with CLN and LND. To accomplish this, have a trait (like our current LnRpc) which we can impl for CLN and LND. Instead of doing it over the network.

m1sterc001guy commented 1 year ago

I think I might have missed some context. Is the main purpose of this new extension process to filter HTLCs before forwarding to the gateway? And this is required for both CLN and LND right?

okjodom commented 1 year ago

More generally, maybe we should

I don't see any 'additional badness' for LND other than the pain we were already going to meet even with . The architecture of LND requires a service running parallel and outside of lnd binary process.

LND requires us to run 1 additional process. We're planning to run 2 additional processes (gatewayd and extension).

Not sure I understand this first part of the new suggestion

The suggestion is to have gatewayd do direct RPC with CLN and LND. To accomplish this, have a trait (like our current LnRpc) which we can impl for CLN and LND. Instead of doing it over the network.

Okay, this is the design approach we were on before #1103 . I would not recommend designing LND intergration any different from CLN integration pattern

Formally, here are the two designs we have:

gatewayd with embedded LnRpc trait calling into a node
- There is an LnRpc trait implementation for each LN implementation supported
- Each LN implementation build a variant of gatewayd to embed an appropriate LnRpc trait implementation
- gatewayd is the extension to a node.
- For CLN, it's deployed as a plugin
- LND, it's necessarily deployed as an extension service
gatewayd with unified rpc client calling into a gateway-<lnrpc>-extension
- There is a gateway-<lnrpc>-extension for each LN implementation supported
- gateway-<lnrpc>-extension is the extension to a node.
- For CLN, it's deployed as a plugin
- LND, it's necessarily deployed as an extension service

gatewayd is a standalone service that calls into any such gateway-<lnrpc>-extension

justinmoon commented 1 year ago

I think I might have missed some context. Is the main purpose of this new extension process to filter HTLCs before forwarding to the gateway? And this is required for both CLN and LND right?

The main purpose is to make it so that the gateway is agnostic to the specific flavor on RPC it is speaking to the lightning node (e.g. which node implementation, which version of that implementation). The interface between the gateway and the extension would be defined by us, and all node implementation details would live in the extension. Filtering HTLCs in the extension microservice is an optimization. This approach would work without doing that.

The debate is really around where we translate operations fedimint gateway wants a lightning node to do -- pay invoice, get node pubkey, subscribe to HTLCs -- into lightning node-specific RPC requests -- JSON-RPC for CLN, gRPC for LND.

One option is to do it in a Rust trait in gatewayd. It would look pretty similar to our existing LnRpc trait, but would need a way to establish subscriptions.

The other option would be to do it in a microservice. This is more complex to deploy and makes the code a little harder to understand, but would have better separation of concerns (gatewayd doesn't need to know anything about lightning node RPC interfaces) and more deployment flexibility.

We just did a call and decided to proceed with the microservice option.

fedimint / fedimint

RPC for settling or canceling intercepted HTLCs after gateway processing #1240