ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Node announcement propagation difficulties after IP address change #5037

Open whitslack opened 2 years ago

whitslack commented 2 years ago

Issue and Steps to Reproduce

  1. Have lots of active channels with nodes that you can't connect to. (They have to connect to you.)
  2. Have your ISP change your IP address.
  3. Have many channels force-closed on you because your peers can no longer find you in the network.

My network's public IPv4 address changed on Tuesday of this week. I immediately restarted my node and verified that the IPv4 address shown in the address list of its getinfo response was indeed my new address. Fast-forward to today (more than three days later), and I still have 516 channel peers that have not yet reconnected to me. I am connected with 636 channel peers, though over 300 of those reconnected immediately since I am able to connect outbound to them. This means that less than half of my channel peers that can only connect to me have done so after more than three days since my IP address changed. This is obviously not healthy. Furthermore, I have lost almost 40 channels to unilateral closures so far. This incurs a nontrivial monetary expense.

There appears to be a significant problem in node announcement gossip propagation across the Lightning Network. This may be due to a design flaw in the BOLT specifications, or there may be an implementation fault in C-Lightning. I will note that 1ML.com started showing my new IPv4 address practically immediately after the change, whereas Amboss.space took about a day to notice the new announcement, and LightningNetwork.plus still isn't showing my new address.

I usually run with dev-fast-gossip enabled, so I experimented with disabling that yesterday and also changed my node's RGB color to try to force a new node announcement to go out, but that achieved only very limited success in getting more of my channel peers to find me.

I am finding it impossible to discern from the debug logs how often my node_announcement is being transmitted to my connected peers, but it seems like it's not going out to enough of them.

getinfo output

I'm on version 0.10.2, and my correct public IPv4 address is listed in the address array.

whitslack commented 2 years ago

I re-enabled dev-fast-gossip in my config and restarted C-Lightning yet again, and now I have 879 connected and 276 disconnected channel peers, and LightningNetwork+ does now show the correct addresses for my node. I guess propagation of my node announcement is happening; it's just very slow.

whitslack commented 2 years ago

I seem to have plateaued at around 900 connected channels and 250 disconnected channels. :frowning_face:

viaj3ro commented 2 years ago

I seem to have plateaued at around 900 connected channels and 250 disconnected channels. ☹️

yeah, looks like it's going exactly like with my node. slow climb and then plateau. The only thing that helped was the alias change.

rustyrussell commented 2 years ago

I can believe this. It's hard to tell you're missing a node_announcement, whereas it's easy to tell if you're missing a channel_announcement. We have heuristics to try to determine if we should ask for more node_announcements, but I'm not surprised that they don't propagate very well across the network in general.

whitslack commented 2 years ago

We have heuristics to try to determine if we should ask for more node_announcements, but I'm not surprised that they don't propagate very well across the network in general.

I don't understand why it doesn't work like transaction propagation in the Bitcoin network. One node says to each of its connected peers, "I have an inventory object with ID xyz." If any peer is unaware of that object, it replies, "Please send me inventory object with ID xyz." It shouldn't take days or even hours to propagate a new node or channel announcement. It should take seconds, like a Bitcoin transaction. Clearly that's not how announcements propagate across the Lightning Network, but I have to wonder why not.

rustyrussell commented 2 years ago

We have a potentially much larger set than the bitcoind mempool. See the million channels project.

Anyway, you're supposed to do an even more naive job than that: every 60 seconds you blast all the gossip you've received. The problem is if you're offline, what to do when you come back. There are various heuristics, but in general it's hard to tell if you're missing node_announcements (since they're optional).

This is why I'm mentoring Alex Myers on the idea of using minisketch for gossip reconciliation, which will fix this properly.

whitslack commented 2 years ago

The problem is if you're offline, what to do when you come back.

At the very least, wouldn't you blast out all of your own announcements to every peer you connect to, first thing? Or are you saying even this would not lead expeditiously to full propagation?

I haven't looked into your implementation, but your recent commit suggests to me that you're not treating the local node's announcements any differently than you're treating the gossip that has come in from other peers. If so, then that seems suboptimal.

minisketch for gossip reconciliation, which will fix this properly.

Awesome. I'll look forward to that.

rustyrussell commented 2 years ago

Oh, we always send our own node_announcments, channel_announcements and channel_updates whether filtered or not. There was a bug when we shuffled our node_announcements where we didn't do this, but it doesn't affect this case.

m-schmoock commented 2 years ago

This should work now as we added IP discovery per default (assuming you have the port forwarded to your node). @whitslack Can you check again?

m-schmoock commented 2 years ago

(If you use recent master or some v0.11.0rc pre release)

whitslack commented 2 years ago

@m-schmoock: Thanks for the tickle. I can't test this on demand, as it could be months to years before my IP address changes again.

m-schmoock commented 2 years ago

@whitslack Also, you should not be using --dev-fast-gosip on mainnet as you get rate limited and others may start to ignore your updates.

whitslack commented 2 years ago

Also, you should not be using --dev-fast-gossip on mainnet as you get rate limited and others may start to ignore your updates.

I don't know how else to get Core Lightning to actually push my fee rate updates. I do passive channel rebalancing by adjusting my fee rates in response to changes in channel balance. Each minute, I select the channel whose current fees are the most in error relative to the fees that the channel's balance of liquidity implies, and I adjust the fees on that channel. Given that I have over 1200 channels, it would take 20 hours to update all of their fees, doing just one per minute, and that's assuming no repeats. If I were to turn off --dev-fast-gossip, then Core Lightning wouldn't even send out my fee updates, which seems broken by design.

m-schmoock commented 2 years ago

Hm, if you send that many updates you get rate limited for sure and the network will start to ignore you. Yes, you send it, but it's like you never send it ;)

Try using the feeadjuster plugin ( https://github.com/lightningd/plugins/tree/master/feeadjuster ) and disable fast gossip. Its just for testing for a reason.

whitslack commented 2 years ago

I know about the fee adjuster plugin. I was consulted during its development. It still doesn't do what I want, though, so I'll stick with my own solution. I will run without --dev-fast-gossip for a week and compare my fee revenues.

m-schmoock commented 2 years ago

@whitslack Im curious on your ways to handle fee updates per channel. Can we improve feeadjuster plugin in a way it suits better?

whitslack commented 2 years ago

I will run without --dev-fast-gossip for a week and compare my fee revenues.

My fee revenue nearly doubled from the previous week. This is surprising to me because essentially none of my channel announcements are being sent out in any reasonable amount of time, so essentially all of my channel announcements out in the network are stale, which logically should mean that attempts to route payments via my channels should usually fail due to incorrect fees. Yet, the reality appears to be that more payments are successfully routing through my node, so I'm flabberghasted.

Im curious on your ways to handle fee updates per channel. Can we improve feeadjuster plugin in a way it suits better?

@m-schmoock: What I didn't like about the feeadjuster plugin is that it's reactive, generating two new channel announcements upon every successfully forwarded payment. That seemed egregiously spammy to me, given that my node is currently forwarding over 35,000 successful payments per week. That'd be an awful lot of channel announcement spam. Instead, I use a cronjob that runs once a minute and finds the one channel whose current fees are the furthest in error (relatively) from the fees that my formulae say the channel ought to have and adjusts that channel's fees. This means I'm generating 10,080 fee changes per week, as opposed to ~70,000 that the feeadjuster plugin would generate.

m-schmoock commented 2 years ago

@whitslack yes, im considering updating this. you can currently disable the reactive way and do a "lightning-cli feeadjust" call via cron as well. but that will iterate on all channels, so no real benefit. maybe a bit better depending on how often (or less) you call it.

Your idea on doing it seems feasible, maybe I add something like it to the code.

How many channels does your node have? I have 35 or so and do about 30 forwards a day...

m-schmoock commented 2 years ago

About the --dev-fast-gossip its better for sure to not have this on main net ;)

whitslack commented 2 years ago

How many channels does your node have?

@m-schmoock: "num_active_channels": 1240

whitslack commented 2 years ago

My fee revenue nearly doubled from the previous week.

May have been a statistical anomaly. In the Unix week that ended a couple hours ago, my fee revenue was back to where it was prior to my eliminating --dev-fast-gossip. It's no worse, though, so I'm happy about that.

ZmnSCPxj commented 2 years ago

My fee revenue nearly doubled from the previous week. This is surprising to me because essentially none of my channel announcements are being sent out in any reasonable amount of time, so essentially all of my channel announcements out in the network are stale, which logically should mean that attempts to route payments via my channels should usually fail due to incorrect fees. Yet, the reality appears to be that more payments are successfully routing through my node, so I'm flabberghasted.

If you get a lot of forwarding attempts, channel announcements are being sent out --- payment failures due to insufficient fees return the latest channel_announcement to the payer. The payer may very well help you propagate the change remotely, especially if it is connected directly to nodes which do not throttle gossip (i.e. currently anything not Core Lightning). This can lead to higher feerate propagation in practice (and may explain why node info propagation seems slow compared to feerate propagation).

Further, if you changed from a higher fee to a lower fee, and then somebody still routes through you successfully, you will silently accept the higher fee. If there was a burst of activity and some alternate routes were congested so that payers end up using your apparently-high-fee channels (which you have recently set to low fee, but that has not propagated out) then you still get paid at the higher feerate.

whitslack commented 2 years ago

@ZmnSCPxj: Thank you for your comments.

payment failures due to insufficient fees return the latest channel_announcement to the payer.

I'm aware of this, though — correct me if I'm wrong — it's only been rather recently that Core Lightning started actually consuming the channel announcements returned in error messages and considering them in routing decisions. My assumption has been that a payer will not consider a failed channel for a second time (with a higher fee rate) in the course of trying to find a usable route for any given payment, and it may not even incorporate the corrected fee information into its local view of the channel graph.

Further, if you changed from a higher fee to a lower fee, and then somebody still routes through you successfully, you will silently accept the higher fee.

I'm aware of this, too, but the problem is the inverse: when a payer selects my channel because the payer believes the channel has a low fee, but my node rejects the forwarding attempt due to insufficient fee. The payer probably will just rule out my channel altogether rather than putting it back into its pool of candidate channels with a revised fee rate.

Maybe I'm being too pessimistic in my assumptions about the current gap between the theoretically ideal implementation and the current, actual state of real-world implementations.

ZmnSCPxj commented 2 years ago

I'm aware of this, though — correct me if I'm wrong — it's only been rather recently that Core Lightning started actually consuming the channel announcements returned in error messages and considering them in routing decisions.

Wrong, it has been there for a long time, before I wrote the first pay implementation. In the past, the updates were fed from the sendpay implementation in lightningd directly into gossipd. Now they are no longer fed from the sendpay implementation, instead the pay implementation handles them, then uses addgossip to feed channel updates into gossipd (much the same behavior, but arguably better architected this way). The docs for addgossip weasel out and say it is "best effort", but if you look at the code, the channel updates are indeed fed into gossipd (though other announcements are not, last time I checked, which is why the weaseling).

The payer probably will just rule out my channel altogether rather than putting it back into its pool of candidate channels with a revised fee rate.

IIRC the old pay implementations did not rule out fee-insufficient errors (other errors would cause the channel to be banned from getroute but I think there was an exception for this error, because it updated fees), though my memory may be fuzzy at this point. The behavior may or may not have been preserved through the various pay iterations, though.

In any case, it is likely to propagate faster than the rated "once a day" limit, due to such fee-insufficient errors effectively leaking out past nodes with ratelimiting. Other implementations might also not rule out fee-insufficient errors outright.