feature request: decommission

m-schmoock commented 4 years ago

Description

Add a new pair of lightning-cli RPC commands that allow the permanent shutdown of a node. The commands may be called decommission and recommission. This was discussed in #3499

These commands can be used to close all channels and prevent acceptance of new funded channels by remote peers. Since this is a basic operation (any node goes offline eventually) and also security related (operator may decide its host maybe compromised) we should not have this as an optional python plugin but maybe as an integrated C plugin.

Decommissioning a node for operational reasons should focus on closing channels cooperatively to keep a low onchain footprint. Decommissioning a node on a hurry for security reasons requires a shorter force-close timeout, an external address and likely higher onchain fees.

`lightning-cli decommission [address_or_xpub] [timeout] [destination]`

Immediately

If [destination] is given, try to send out available channel liquidity to [destination] pubkey via keysend for a given (short) timeout, e.g. 5 minutes.
Close all responsive channels.
Send any immediately available on-chain funds to an optional [address_or_xpub], if specified.
Start reject opening of new channels by the operator with a proper error message indicating the decommissioned state of this node.
Stop accepting newly funded channels from remote peers.
In the meantime
Force-close all uncooperative or offline channels after given timeout, defaulting to a slow decommission timeout of 24hr to give remote counterparts a chance to get online in the meantime for cooperative close. If the user is on a hurry, he can repeat the command with a faster timeout value.
Redirect any settled and released on-chain funds resulting from closed channels to an optional [address_or_xpub], if specified.
Redirect other received on-chain funds to [address_or_xpub], if specified.
If we are in slow mode, we can wait and aggregate a batch onchain transaction to save on fees.
When done
Move the hsm_secret file to a backup location, so the node can't accidentally be started after the decommission completes.
Stop the node

`lightning-cli recommission`

Cancels an ongoing decommission process by:

Restore the moved hsm_secret backup, if prior decommission was finished.
Allow again creation of new channels by the operator
Accept remote funding again
Stop redirecting released funds to external wallet

`lightning-cli commissionstate`

Show the user the state an process of an ongoing decommissioning operation:

state
redirect destination (if set)
channels closed
channels remaining
time remaining until force-close timeout
...

Technical Requirements and Stepstones

[ ] integrated C plugin with json rpc commands and descriptions
[ ] database variables for commission_state and [address_or_xpub] (if we go for xpub we also need to count key index)
[ ] local fundchannel hook or hook rpc_command('fundchannel')
[x] make openchannel hook chainable. If multiple plugins return close_to, take first one and warn about it.
[ ] hook openchannel for remote fundings, if decommission is running, reject them
[ ] wallet hook to redirect funds
[ ] also add the commissioned/decommissioning/decommissioned state to the RPC getinfo output.
[ ] some unit tests
[ ] a nice manpage
[ ] a beer at the winchester

Open Questions and Points of Discussion

As @ZmnSCPxj mentioned (https://github.com/ElementsProject/lightning/issues/3550#issuecomment-657363693) an alternative to having database variables, which is currently not easily usable for plugins, we could put the stateful information in a local file called i.e. 'decommissioned' which just contains information about slow/fast mode, timeout etc andaddress_or_xpub information etc. Also the user could simply/safely delete the file in oder to stop decommission.

ZmnSCPxj commented 4 years ago

Would it be useful to provide an xpub or ypub or zpub to generate addresses to? Or maybe an array of addresses? Because some of the funds will take a while to recover (unilateral closes) and it would be preferable to transfer onchain funds now, then transfer in-channel funds as soon as they become available. Even mutual closes are not atomic, as it involves waiting for HTLCs to resolve or fail, then negotiating, then doing the actual close.

The "decommissioned" state should probably persist, should it not? So on restart of the daemon it is still decommissioned and it will still reject new incoming fundchannels and outgoing fundchannels and so on. So maybe a variable in the database as well.

m-schmoock commented 4 years ago

Yes, I have thought about having x/y/zpubkeys as an alternative to a fixed extrenal_address for obvious reasons. Maybe we can allow both and check syntax.
Yes this state must persist and we need a database variable for this.

cdecker commented 4 years ago

As a further step the plugin could also reject any incoming connections that do not match a channel that is to be closed, that way we truly take it offline, while still giving us a chance to close collaboratively.

m-schmoock commented 4 years ago

As a further step the plugin could also reject any incoming connections that do not match a channel that is to be closed, that way we truly take it offline, while still giving us a chance to close collaboratively.

Hm, did I get this right? If we already closed unilaterally any offline or uncooperative channels in the first place, are channels that go online afterwards (while being force closed) still able to do a collaboratively close? Or do you mean we should set the default timeout for force-close to a higher value i.e. 24hr, so we give them a sufficient timeframe to come online before we force close?

cdecker commented 4 years ago

Yes, I was thinking about trying our best to close collaboratively, and only after a long timeout actually force-close. So in my mind the sequence should be:

Start rejecting connections and requests to open channels
Try reconnecting with peers that have an active channel for 24h, if we are able to connect, or get an incoming connection from a peer with a channel we immediately initiate a collaborative close
After 24h of trying to play nice, start force-closing and thus reject all connections
Wait for all the force-closes and pending HTLCs to settle
withdraw all to the specified address
Move the hsm_secret file to a backup location, so the node can't accidentally be started after the decommission completes
Stop the node

This minimizes our onchain footprint and fees :wink:

ZmnSCPxj commented 4 years ago

This suggests a third API s well: getcommissionstatus, giving a commisionstatus field in the result:

"commissioned" - Currently allowing all incoming connections and channel fundings.
"decommissioning" - Still has channels open and not completely removed (closing transactions not yet deeply removed), some onchain funds still held, but rejecting (most) incoming connections and channel fundings.
"decommissioned" - No channels remaining, no funds remaining, decomissioned.

I think a "fast" decommisision would want to push as much of the funds immediately to target address(es), you might want this in case you suspect key material replication by an unauthorized party, while you might do a "slow" decommission which wants low fees by waiting for everything to come onchain and then sending all the funds in a single tx to the target address, for example if you decide to burn the server you were using as a natural part of replacing servers. But implemeting two forms of decommission is.... twice as complicated as implementing just one. I would think the "fast" decommission where we send funds to the traget address(es) ASAP is better, since upgrading your server is not as much of an emergency as recovering from a suspected key material replication.

m-schmoock commented 4 years ago

@ZmnSCPxj Its all about what to use as default timeout value. I though about using a sufficiently high value to do a slow decommission by default, BUT telling the user via JSON RPC response that slow (default) timeout was choosen, incase he is in a hurry, he should repeat the command with a low timeout value.

Additionally we can offer a fastdecommission command (bad name, suggestions welcome) that is the same command with a quite low timeout value.

cdecker commented 4 years ago

I don't quite see the complexity of switching between the two nodes: we can just have a withdrawable_amount function that returns 0 if we can't withdraw yet. It just needs to return 0 almost always in slow mode, and it returns any on-chain amount larger than dust in fast mode. By subscribing to all notifications triggered by any fund change we can withdraw as we go or defer until everything is settled.

cdecker commented 4 years ago

Fwiw I also prefer shorter RPC names in favor of (optional) command parameters.

m-schmoock commented 4 years ago

I don't quite see the complexity of switching between the two nodes: we can just have a withdrawable_amount function that returns 0 if we can't withdraw yet. It just needs to return 0 almost always in slow mode, and it returns any on-chain amount larger than dust in fast mode. By subscribing to all notifications triggered by any fund change we can withdraw as we go or defer until everything is settled.

@cdecker can you try to explain this on other words?

Also, for fast decommission, I think we might want have an option to send LN balances to a safe node first if possible, as waiting for closures and settlement introduces risks we want to avoid in fast mode. Infact, if you think about it, fast mode and slow mode are two different operations. The one with speed and security in mind, and the other one with cost, footprint and operational continuity in mind.

m-schmoock commented 4 years ago

Note: I updated and rewrote parts of the mission statement in the description of this issue

cdecker commented 4 years ago

I don't quite see the complexity of switching between the two nodes: we can just have a withdrawable_amount function that returns 0 if we can't withdraw yet. It just needs to return 0 almost always in slow mode, and it returns any on-chain amount larger than dust in fast mode. By subscribing to all notifications triggered by any fund change we can withdraw as we go or defer until everything is settled.

@cdecker can you try to explain this on other words?

Just a bit of rambling about where we could differentiate the two modes. TL;DR: I don't think we need two completely distinct entrypoints into the decommission flow, we just need to differentiate when it comes to actually executing the withdraw. In fast mode we'd trigger a withdraw every time we have sufficient funds (>> dust), while in economic mode we'd wait for everything to be settled. That's the only difference imho.

Also, for fast decommission, I think we might want have an option to send LN balances to a safe node first if possible, as waiting for closures and settlement introduces risks we want to avoid in fast mode. Infact, if you think about it, fast mode and slow mode are two different operations. The one with speed and security in mind, and the other one with cost, footprint and operational continuity in mind.

I think those are two distinct operations: drain channels followed by a decommision. Drain should be done independently of whether we are fast or economical, since it minimizes the number of outputs we need to withdraw and allowing us to spend more on on-chain fees (pushing our transaction faster in fast mode, and making things more economical in eco mode).

ZmnSCPxj commented 4 years ago

See also: https://arxiv.org/pdf/2007.00764.pdf especially section 4.2, for stuff having to do with closing channels and redirecting the funds back to the owner cold wallet.

ZmnSCPxj commented 4 years ago

openchannel is currently not a chained hook. So a builtin plugin hooking into openchannel would prevent user plugins from using the openchannel hook as well.

Our options are:

Make openchannel a chained hook.
- What do we do if multiple plugins return { 'result': 'continue', 'close_to': 'random' } with different close_to addresses? We know decommission will not do that, but multiple user plugins might.
Give a separate hook just for decommission.
- A bit bleah though, since the logic is exactly the same, we just want to add a hook for decommission that does not prevent user plugins from hooking into openchannel as well.

ZmnSCPxj commented 4 years ago

In fast mode we'd trigger a withdraw every time we have sufficient funds (>> dust), while in economic mode we'd wait for everything to be settled. That's the only difference imho.

We should probably also consider using feerate=urgent for fast mode, and feerate=normal or even feerate=slow for economical mode.

Rather than put decommissioning status in the database (which either requires the entire commissioning flow to be in lightningd rather than a plugin, or expose some commands that allow most of the commissioning flow to be done in a plugin except for the database access commands which would be nasty footguns), how about an optional file decommissioned? decommission would create this file, recommission would delete this file, and a user could create/delete the file themselves and the decommissioning flow would react to that because of magic inotify. The file could contain the string "slow" for economical decommissions, and everything else would be fast decommission.

m-schmoock commented 4 years ago

@ZmnSCPxj I address openchannel_hook chainable by https://github.com/ElementsProject/lightning/pull/3960/commits/aecba6fae87b22e4f07bc4bd467c74531dbb19af Will open a dedicated PR for this once I added tests

ElementsProject / lightning