ZmnSCPxj / clboss

Automated Core Lightning Node Manager
MIT License
211 stars 32 forks source link

The bigger the node, the bigger the channels, the smaller the peers. #81

Closed hoplaaa closed 2 years ago

hoplaaa commented 2 years ago

Once reaching ~35Msat local capacity, clboss starts to open larger channels, but is also choosing smaller peers. As a result a huge percent of total liquidity is allocated to very small peers. For example, the latest automatic channel creation was 6M+ for a peer of 28M total capacity.

Expected behavior : Consider the peer capacity at channel size definition.

Step to reproduce :

  1. Create a new node
  2. Fund 3Msat per day
  3. Once having ~35M local capacity, the plugin start to open big channels (5M+) with relatively small peers.
ZmnSCPxj commented 2 years ago

This is deliberate (i.e. it is a feature, not a bug), as large nodes are already large and adding more capacity to them makes them even larger.

Consider: If we were always biased towards larger nodes, then larger nodes become even larger, which increases our bias towards them, which makes them larger, which increases our bias towards them, which makes them larger. Then you might as well centralize Lightning around LNBIG and forget about the rest of the network.

ChannelFinderByPopularity looks at the size of nodes ("popularity") for selection. It has this bit of code: https://github.com/ZmnSCPxj/clboss/blob/c29f323613b01d3d3a68cf3ff568df110515c912/Boss/Mod/ChannelFinderByPopularity.cpp#L219-L223

Basically, once ChannelFinderByPopularity sees you have some number of channels, it will reduce its participation in the finding process. Normally it will look for 5 nodes and propose those. But once you have more than 4 channels, it will just give 1 proposed node at a time.

Other channel finders then kick in:

The policies each one has are:

So yeah --- basically ChannelFinderByPopularity already does what you want, but is programmed to reduce its participation and let smaller nodes have a chance to be connected to your node, because otherwise we risk becoming a centralization pressure on Lightning.

The logic here is that we need good connectivity to the network first, so we start out looking at large popular nodes, but once we have a few good connections, we should stop making large nodes even larger or else we accelerate Lightning centralization.

btweenthebars commented 2 years ago

Do we have it in the log or db of why a node is on the recommended list? If it's found by ChannelFinderByEarnedFee, I want to know. One puzzling thing is I have a private tiny channel to the Lightning.Watch for the service they provide to alert when node is down. I got a few recommendations on the clboss list of nodes with only 1 small channel to Lightning.Watch with the score of 24. They are taking up space on the list. I had to make them go away by adding to the unmanaged list.

Btw, yea, I had, clboss opened 13M to a 25-peers node, and 7M to 15-peers low capacity node. I see your point on not moving towards to centralization, but if a node is too small, it likely can't utilize the given liquidity, which is wasteful of sat.

Also, don't you consider well connected nodes that consistently charge low fee ? It could mean more traffic, more income. ChannelFinderByLowFee

ZmnSCPxj commented 2 years ago

Do we have it in the log or db of why a node is on the recommended list? If it's found by ChannelFinderByEarnedFee, I want to know

Not yet, it probably should be recorded.

score of 24

These scores are based only on uptime --- this pool is for investigating uptime, not how well they are connected or how "good" they are by other metrics. Determining whether a node is a good candidate or not based on their connectivity fees etc. just requires CPU time, but checking their uptime continuously is time-consuming --- just because they are online now does not mean they are online continuously. That is why the score saturates to "24", on average nodes on that pool get checked about once an hour, so 24 means they have been checked for the last 24 hours (approximately, there is a good bit of randomization here) and been online during that time. I should probably rename it to "uptimeness" instead of "score" because people get confused.

Also, don't you consider well connected nodes that consistently charge low fee ? It could mean more traffic, more income. ChannelFinderByLowFee

Could modify ChannelFinderByPopularity to do that. I consider feerates to be easily gameable, some nodes set 0 fees on their node for example. Which implies that those low-fee nodes are getting compensated by other means --- they might be surveilling payments that pass through them and selling that data. Since ChannelFinderByPopularity reduces its participation later, that is the safest point to allow being affected by feerates.

ZmnSCPxj commented 2 years ago

but if a node is too small, it likely can't utilize the given liquidity, which is wasteful of sat.

How small is "too small"?

Note that every candidate has a "patron", one of its peers which we think is a good reason to channel with the candidate. We measure the capacity between the candidate and its patron and use it as basis for channel size to the candidate. Capacity of direct channels between them is halved, routes of two hops between them get 1/3 capacity, and that is used as the basis of the capacity towards the candidate. This assumed available capacity could be reduced.

Fund 3Msat per day

Well, there is your problem. I designed CLBOSS on the assumption that you would want to keep your costs down, meaning you would make one big deposit at the start, because one onchain action is cheaper than multiple smaller daily ones. If you had made one large transfer to CLBOSS at the start, then much of that capacity would be to popular, high-connectivity nodes, because the start is when ChannelFinderByPopularity is most aggressive. As-is, you gave it 3M at the start, and by policy if CLBOSS sees you have 0 channels, it will make two (cannot forward with just one, right?), so the initial 3M got split into approximately 1.5M to two popular nodes. After that, it notices it has no incoming capacity (cannot forward without that, right?) and swaps offchain funds for onchain funds, getting incoming capacity that way and getting back approximately 1.5M, which it puts to a third popular node. There goes the 3 channels out of the 4 that would then disable ChannelFinderByPopularity.

To handle this usage pattern, ChannelFinderByPopularity would need to check for sudden percentage increases in the local node available funds (all funds, in both channels and onchain) and re-increase its participation in case of sudden jumps.

btweenthebars commented 2 years ago

How small is "too small"?

Size isn't the right word. If one goal of clboss is to utilize the Sat, it should try find a busy node. Busy could be thought as traffic / number of peers and capacity. Although traffic information is private, bot should have the advantage of collecting data/probing channels and analyzing them over time. I'm not sure if clboss does that very well, because when I turned it on, threw money at it, it opened channels right away. I would like it to see it observes the network for quite some time, until it is really sure that a new channel that it opens will put my sat to work and of course break even.

hoplaaa commented 2 years ago

This is deliberate (i.e. it is a feature, not a bug), as large nodes are already large and adding more capacity to them makes them even larger.

Thanks for the detailed answer about how new peers are chosen. I'm fine with that.

Well, there is your problem. I designed CLBOSS on the assumption that you would want to keep your costs down, meaning you would make one big deposit at the start, because one onchain action is cheaper than multiple smaller daily ones.

Yes, you are right, this corresponds well to the philosophy of this plugin. I think it deserves a more explicit explanation in the main readme to prevent users to do the same mistake.

Btw, yea, I had, clboss opened 13M to a 25-peers node, and 7M to 15-peers low capacity node. I see your point on not moving towards to centralization, but if a node is too small, it likely can't utilize the given liquidity, which is wasteful of sat.

Thanks for the feedback. Intuitively I would also assign less liquidity.

How small is "too small"?

Let's say here that the peer size is not the issue. You already explained how connecting to edges is important. The idea I was raising was to somehow cap the ratio of the channel size / peer size.

Note that every candidate has a "patron", one of its peers which we think is a good reason to channel with the candidate. We measure the capacity between the candidate and its patron and use it as basis for channel size to the candidate. Capacity of direct channels between them is halved, routes of two hops between them get 1/3 capacity, and that is used as the basis of the capacity towards the candidate. This assumed available capacity could be reduced.

That part is not clear enough for me yet. I can say that I got a channel which was 2 times bigger than the biggest channel of the peer and was my new biggest channel with about 15% of my local capacity.

ZmnSCPxj commented 2 years ago

Although traffic information is private, bot should have the advantage of collecting data/probing channels and analyzing them over time. I'm not sure if clboss does that very well, because when I turned it on, threw money at it, it opened channels right away. I would like it to see it observes the network for quite some time, until it is really sure that a new channel that it opens will put my sat to work and of course break even.

The ChannelFinderByEarnedFee does precisely that: if one of our peers has been getting a lot of forwards from us, then possibly it or one of its peers is a popular destination and deserves more capacity. So it looks at the peers of the peer and puts that as candidate, and the peer itself as patron.

Unfortunately, in order to actually get that information in the first place, it has to go connect to the network. That means opening channels immediately, so it can get on the network and start observing. If the node is off the network, it can observe nothing and gather no information. This is why CLBOSS makes channels immediately, and uses the popularity as basis.

Possibly what could be done would be to hold off on some of the capacity at the start, instead of immediately trying to put all onchain funds into channels. Not quite sure how to do that yet, any ideas?

I think it deserves a more explicit explanation in the main readme to prevent users to do the same mistake.

Ideally, it should not require explanation. This is why I proposed that ChannelFinderByPopularity should "perk up" if it sees a sudden jump in local owned capacity (say >25% increase) and start participating more. After all, if you thought to do that, that is evidence that future users might do that as well, so adapting to it should be what CLBOSS does, not add more stuff in documentation that never gets read anyway.

ZmnSCPxj commented 2 years ago

I can say that I got a channel which was 2 times bigger than the biggest channel of the peer and was my new biggest channel with about 15% of my local capacity.

You can check the dowser using clboss-dowser, candidate is the first arg, patron is the second arg. The dowser is the algorithm that guesses a good size for channels. Just because the biggest channel of a candidate node is small does not mean that the candidate node does not have two or more channels with the patron, which means it can get.a much larger dowser result than its biggest channel. Or its channel with its patron could have gotten closed since the channel was opened.

The Real Problem is that I never actually show the patron anywhere, possibly not even in logs, sigh. I should probably expose that.

The Other Real Problem is that the dowser algorithm is biased against nodes that have lots of peers. Consider that if a candidate has only two peers. One of those peers has to be the patron. In that case, the Dowser will consider its capacity with its patron, and that capacity represents half of the total capacity of the candidate node. Consider the alternate case if a candidate has two hundred peers. One of those peers has to be the patron. In that case, the Dowser will consider the capacity of the candidate with its patron, but that capacity is just 1/200 of the total capacity of the candidate node. So maybe the Dowser needs to be modified to consider the total capacity of the candidate node in addition to the capacity with its patron.

btweenthebars commented 2 years ago

These scores are based only on uptime --- this pool is for investigating uptime, not how well they are connected or how "good" they are by other metrics.

should we have another score field that's based on how frequent clboss recommend a node ? For example if a node is discovered by ChannelFinderByEarnedFee, it could be one time luck. If clboss keep detecting it for months, it's likely a real good candidate.

Also if a node is recommended by both ChannelFinderByDistance and ChannelFinderByEarnedFee it should get higher score, etc.

ZmnSCPxj commented 2 years ago

should we have another score field that's based on how frequent clboss recommend a node ? For example if a node is discovered by ChannelFinderByEarnedFee, it could be one time luck. If clboss keep detecting it for months, it's likely a real good candidate.

Good idea. Suggest putting in a separate issue.

hoplaaa commented 2 years ago

Ideally, it should not require explanation. This is why I proposed that ChannelFinderByPopularity should "perk up" if it sees a sudden jump in local owned capacity (say >25% increase) and start participating more. After all, if you thought to do that, that is evidence that future users might do that as well, so adapting to it should be what CLBOSS does, not add more stuff in documentation that never gets read anyway.

25% seems arbitrary, we have to know in advance to top up by at least 25% to trigger it. Following the given logic, channel finder could start with "Am I well connected?" no -> use ChannelFinderByPopularity, yes -> use other methods. Of course we would need to define "Am I well connected?".

hoplaaa commented 2 years ago

You can check the dowser using clboss-dowser, candidate is the first arg, patron is the second arg. The dowser is the algorithm that guesses a good size for channels.

Nice thanks. Very useful. Results are ~10 times bigger that what sounds logic to me. I never reach such size at channel creation because I never have enough free capacity to fulfill these numbers. I have to say that I still don't get the logic of the dowser, the code doesn't match clearly your previous explanations and I get lost in between Amount, capacity, capacity of a single route hop, capacity of multiple route hop, theoretical_capacity, ... I need to dig more. I will. Not understanding the code, I guess I'm wrong, but, haven't you forgot to divide by dowser_limit when returning the amount here. That would be the bug we are rounding about since the beginning. https://github.com/ZmnSCPxj/clboss/blob/54af2785c4b324c2f1ba4b735d0d26f1d9a60a80/Boss/Mod/Dowser.cpp#L113-L116

The Real Problem is that I never actually show the patron anywhere, possibly not even in logs, sigh. I should probably expose that.

I have it at INFO level logs, lines looks like this, that's enough here knowing that we can call clboss-dowser. INFO plugin-clboss: ChannelFinderByEarnedFee: Proposing peers of peer PatronID, who earned us X.000000 msat/day. INFO plugin-clboss: ChannelCandidatesPreinvestigator: Proposing CandidateID (patron PatronID)

ZmnSCPxj commented 2 years ago

I guess I'm wrong, but, haven't you forgot to divide by dowser_limit when returning the amount here.

No, we are not looking at an average, we are looking at the total, so division is not done.

hoplaaa commented 2 years ago

No, we are not looking at an average, we are looking at the total, so division is not done.

Ok, thanks.

I went back to my logs, I have an example of a Candidate (~10 peers, <10M total capacity) pairing with a very big Patron. clboss-dowser gives me now ~500K which makes sens to me, but I got opened a >5M channel to that node. That Candidate hasn't had meaningful changes on the lasts 2 weeks.

That channel was created in a batch of 2 channels few seconds after a channel creation failure: 'ChannelCreator: all channels failed to construct.'. The other channel created in that batch has the same problem : ~5M at channel creation for ~500K now.

More interesting, the channel sizes of the previous batch (3 channels: ~10M, ~1M, ~1M) are matching perfectly what I have in clboss-dowser now. This batch has failed about 10 second before.

I can provide more logs in a more private way if you need.

ZmnSCPxj commented 2 years ago

Note that ChannelCreator will avoid making channels smaller than 5mBTC (500,000sats). A dowsing result of less than 5mBTC will cause ChannelCreator to reject the node outright and ask the investigator to forget the candidate. Probably borderline?

multifundchannel should not attempt to merge channel sizes at all, so it is not that. How long ago was the Candidate channel created?

If the CLBOSS is new enough, however, it is possible that channel finders simply do not have enough candidates yet. In that case, ChannelCreator will take all the available funds and split them up among the available proposals in addition tot the dowsing result. So it is possible the 5M-0.5M was added due to being split up among the only two available candidates at that time:

https://github.com/ZmnSCPxj/clboss/blob/00cdded89cd5eac01ee5c6bb4675ecc7fa20778f/Boss/Mod/ChannelCreator/Planner.cpp#L124-L155

The reasoning for this is that it is cheaper to build multiple channels now that are larger than justified, than to leave behind a substantial amount that will be used to build a new channel later in a subsequent command (i.e. subsequent onchain transaction).

However, that may need to be modified; it may be better to just not build any channels at all in the case where this dividing of the remainder may create a channel that is far too large compared to the dowsing results.

hoplaaa commented 2 years ago

Results are ~10 times bigger that what sounds logic to me. I never reach such size at channel creation because I never have enough free capacity to fulfill these numbers.

I was wrong previously, dowser is good. For these test I got candidates in the top 100-200 capacity and Patron in top 0-10 capacity. In such case the dowser give results like ~50Msat, over my entire local capacity. This is normal. That value is caped here : https://github.com/ZmnSCPxj/clboss/blob/03c0bd2b34dbc8222c12dc24954ed3db0ce9e06a/Boss/Mod/ChannelCreator/Manager.cpp#L28-L30 In my case max_amount seems slightly too big, 5M sounds better than 16M. In case a bigger node, 16M might be too small. max_amount could be ideally by default smaller but could increase with the node capacity, something like max(5M, 10% of node capacity).

Note that ChannelCreator will avoid making channels smaller than 5mBTC (500,000sats). A dowsing result of less than 5mBTC will cause ChannelCreator to reject the node outright and ask the investigator to forget the candidate. Probably borderline?

Both are dowsed in the 500Ksat-700Ksat range and somehow got ~half of the ~12M fund available onchain.

Using dowser call, out of my ~30 Candidates (channel_candidates from clboss-status): ~10 have 0 capacity to their Patron ~10 have < 500K capacity to their Patron ~10 have > 500K capacity to their Patron. In conclusion 2/3 of my candidates are waste. ChannelCandidatesPreinvestigator could drop Candidates which have too small capacity to their Patron (<500Ksat).

How long ago was the Candidate channel created?

Node was created in mid-late October. Candidate proposal was in mid November. Channel batch creation was at the end of November.

If the CLBOSS is new enough, however, it is possible that channel finders simply do not have enough candidates yet. In that case, ChannelCreator will take all the available funds and split them up among the available proposals in addition tot the dowsing result. So it is possible the 5M-0.5M was added due to being split up among the only two available candidates at that time:

Yes, this looks like what happened. I think it's better to keep the funds onchain until we have a full clean batch of candidates. By clean I mean with sizing matching dowser results. Storing dowser results with candidates could be useful for that check. Dowser final value can be still reassessed at channel creation.

That said, we can try to understand why there was not enough candidate. First reason is that most of the candidates are 'fake', not fitting capacity to patron requirement (see the previous proposal). Second reason is that a channel creation batch can fail. In that case, some (if not all?) candidates are dropped. Immediately a new channel batch creation is initiated with candidate leftovers. This might not be a good idea as we already tried our best candidates. It might be more strategic to wait for having a better candidate list. Third reason can be the situation described here #79 : starting clboss for the first time with fund onchain, the 2 first candidate will be get ~half of the onchain fund, whatever their dowser sizing.

ZmnSCPxj commented 2 years ago

That said, we can try to understand why there was not enough candidate.

Zeroth: if the node is new, it has no gossip_store, and has to download the whole gossip map. This takes a surprisingly long time (2 to 3 days for a full map), since a lot of implementations limit how much of the map you can get from them per unit time (because gossiping is a bandwidth hog). Since the main channel finder at the start is ChannelFinderByPopularity, and that requires looking at the map, it cannot find a lot of good candidates at the start. Worse, there is no standard order in which you are given the map, so you might, by chance, be given only nodes that are not very popular, but because those nodes are all you know, selecting the most popular ones among those still selects some nodes that are not at all popular. You might also get told of a popular node and only one or two of its channels, with the rest of its channels still to be sent later on but the sending gets delayed because of the bandwidth limiting each node imposes, and since you only know it has one or two channels, you might misjudge it as unpopular.

ZmnSCPxj commented 2 years ago

In my case max_amount seems slightly too big, 5M sounds better than 16M.

max_amount here is motivated more by the fact that most nodes (at the time the code was written) were not wumbo-enabled yet; the 16777215sats is the pre-wumbo limit.

ZmnSCPxj commented 2 years ago

25% seems arbitrary, we have to know in advance to top up by at least 25% to trigger it.

No, the user does not have to know about this.

If the user has the same pattern, i.e. they send X amount every day for N days, then on day 2 CLBOSS will see a +100% bump in owned funds and will re-trigger this, on day 3 CLBOSS will see a +50% bump in owned funds and will re-trigger this, on day 4 CLBOSS will see a +33% bump in owned funds and will re-trigger this, on day 5 CLBOSS will see a +25% bump in owned funds and will re-trigger this. Surely that is enough to fill up the candidates pool with popular nodes. Yes, more complex behavior from users can fail to trigger this, but if you are doing something complicated with the node, you are no longer actually letting CLBOSS manage it but instead managing it yourself.

hoplaaa commented 2 years ago

*closed by mistake, reopen

hoplaaa commented 2 years ago

I think it's better to keep the funds onchain until we have a full clean batch of candidates. By clean I mean with sizing matching dowser results.

That's the solution of this issue, do you agree ?

hoplaaa commented 2 years ago

That said, we can try to understand why there was not enough candidate.

Zeroth: if the node is new, it has no gossip_store, and has to download the whole gossip map. This takes a surprisingly long time (2 to 3 days for a full map), since a lot of implementations limit how much of the map you can get from them per unit time (because gossiping is a bandwidth hog). Since the main channel finder at the start is ChannelFinderByPopularity, and that requires looking at the map, it cannot find a lot of good candidates at the start. Worse, there is no standard order in which you are given the map, so you might, by chance, be given only nodes that are not very popular, but because those nodes are all you know, selecting the most popular ones among those still selects some nodes that are not at all popular. You might also get told of a popular node and only one or two of its channels, with the rest of its channels still to be sent later on but the sending gets delayed because of the bandwidth limiting each node imposes, and since you only know it has one or two channels, you might misjudge it as unpopular.

For 0th(start clboss a new node) and 3th(start clboss on an existing node), being more patient before creating channels would help improving the quality of the candidate list. "ChannelCreator: " can wait 24h of clboss uptime before beeing allowed to run. May I open a dedicated issue for that subject ?

For 1st (2/3 of candidates are fake), may I open a dedicated issue for that subject ?

For 2nd (creation failure remove the best candidates from the list), there is no real problem here. If I understand well, all candidates from the list can be considered as good enough to be added. What happened for me is that there was not candidates.

hoplaaa commented 2 years ago

In my case max_amount seems slightly too big, 5M sounds better than 16M.

max_amount here is motivated more by the fact that most nodes (at the time the code was written) were not wumbo-enabled yet; the 16777215sats is the pre-wumbo limit.

Ok for now, that's not a priority.

ZmnSCPxj commented 2 years ago

Yes, please open new issues.

ZmnSCPxj commented 2 years ago

"ChannelCreator: " can wait 24h of clboss uptime before beeing allowed to run

Do you mean continuous uptime? What uptime? Just CLBOSS running or it has to have an Internet connection? Does it have to have connected peers continuously?

ZmnSCPxj commented 2 years ago

I think it's better to keep the funds onchain until we have a full clean batch of candidates. By clean I mean with sizing matching dowser results.

That's the solution of this issue, do you agree ?

It seems better to me to be able to connect to the network quickly, so that we can start using ChannelFinderByEarnedFee and others?

For this particular issue the root is the pattern of how the funds were given, which #82 fixes. Giving candidates far more than their dowsing results seems a different issue.

hoplaaa commented 2 years ago

I think it's better to keep the funds onchain until we have a full clean batch of candidates. By clean I mean with sizing matching dowser results.

That's the solution of this issue, do you agree ?

It seems better to me to be able to connect to the network quickly, so that we can start using ChannelFinderByEarnedFee and others?

I may have confused you with first message, but please do not focus on the initialization of the node. My issue didn't happened at the first days of the node. The node was already more than 1 month old and got more than 20 channels. ChannelFinderByEarnedFee has been called 18 times before that event. The issue I'm trying to point out is that I got created 2 big channels to 2 small peers about 1 month after the node creation.

ZmnSCPxj commented 2 years ago

The issue I'm trying to point out is that I got created 2 big channels to 2 small peers about 1 month after the node creation.

Then please file a separate issue for this particular bit.