ElementsProject / lightning

Core Lightning — Lightning Network implementation focusing on spec compliance and performance
Other
2.84k stars 901 forks source link

Pruning problem #5310

Closed kroese closed 5 months ago

kroese commented 2 years ago

I did a fresh installation of bitcoind with pruning set to 100gb. I waited untill it was completely synced. Then I installed CLN and connected to the bitcoind node.

The problem is that for the last hour it keeps requesting the same block from 2018 every second in a loop:

UNUSUAL plugin-bcli: /usr/bin/bitcoin-cli -datadir=/data/.bitcoin -rpcconnect=172.17.0.2 -rpcport=8332 -rpcuser=... -rpcpassword=... getblock 00000000000000000005f7a06bd4efe545999aba00eeff9a49747a3cd1f3c9df 0 exited with status 1

I don't understand why it keeps on trying this same block, since it should realize it's not available after trying only once. I think it heard of this block through channel gossip, since I don't have any channels yet myself.

Besides the problem with CLN getting stuck on this block, I think I will have another problem.

Namely that my graph will miss all channels created more than a year ago? I thought a pruned node would be fully functional, but if I miss all the old channels it is a big downside.

So is my mistake that I should have already started CLN while Bitcoin was still syncing the chain? That way CLN would have had access to the blocks from 2018 that are now pruned. Or is there no solution?

getinfo output

{
   "id": "xxxx",
   "alias": "xxxx",
   "color": "xxxxxx",
   "num_peers": 1,
   "num_pending_channels": 0,
   "num_active_channels": 0,
   "num_inactive_channels": 0,
   "address": [
      {
         "type": "ipv4",
         "address": "xx.xx.xx.xx",
         "port": 9760
      }
   ],
   "binding": [
      {
         "type": "ipv4",
         "address": "0.0.0.0",
         "port": 9735
      }
   ],
   "version": "v0.10.2",
   "blockheight": 739674,
   "network": "bitcoin",
   "msatoshi_fees_collected": 0,
   "fees_collected_msat": "0msat",
   "lightning-dir": "/data/.lightning/bitcoin"
}
vincenzopalazzo commented 2 years ago

Can you check if your bitcoin instance has the block 00000000000000000005f7a06bd4efe545999aba00eeff9a49747a3cd1f3c9df? for pruning we have better alternative like https://github.com/clightning4j/btcli4j or other backend listed in https://github.com/lightningd/plugins

In addition, I think that the two problems that you have are related, in particular I think that in the last year the blockchain grows more than 100 GB

kroese commented 2 years ago

@vincenzopalazzo No, this block is from 2018 and I have only blocks from a year ago.

I am just trying to understand:

I would rather not switch to another backend. I thought pruning was fully supported as long as you make sure C Lightning does not get behind too far.

kroese commented 2 years ago

I did some more research and it seems indeed that the mistake was to wait until the IBD was completed, before starting C-Lightning. I should have let them run together while syncing. But this introduces other problems as C-Lightning syncs slower than Bitcoind and can get behind too far.

The best solution would if C-Lightning just implemented the getblockfrompeer RPC call that was recently added to Bitcoin.

So now my only option is to connect C-Lightning to an external full node (without pruning) to let it validate all the channels in the graph.

That leads me to the final question:

Is it safe to switch C Lightning from the unpruned node back to the pruned node after it validated all the channels? And how do I know it has finished validating every channel in the graph, so that I have the garantuee it will never need an old block again?

jb55 commented 2 years ago

The best solution would if C-Lightning just implemented the getblockfrompeer RPC call that was recently added to Bitcoin

this is an interesting idea as a fallback to getblock. cln on pruned nodes has always been a huge pain.

vincenzopalazzo commented 2 years ago

this is an interesting idea as a fallback to getblock. cln on pruned nodes has always been a huge pain.

Working to translate it in a compiled language (really compiled)

vincenzopalazzo commented 2 years ago

Is it safe to switch C Lightning from the unpruned node back to the pruned node after it validated all the channels? And how do I know it has finished validating every channel in the graph, so that I have the garantuee it will never need an old block again?

I think if you have old channel you need to verify them, so if you have a channel old 10 years can be a problem, However, I'm not 100% sure about that.

cc @cdecker

kristapsk commented 2 years ago

So is my mistake that I should have already started CLN while Bitcoin was still syncing the chain? That way CLN would have had access to the blocks from 2018 that are now pruned. Or is there no solution?

My approach is to start bitcoind and then CLN while it is still syncing. I noticed that bitcoind prunes faster than CLN processes blocks, use this script also constantly running as a workaround (it has locking, so just add * * * * * /home/cln/cln-prune-protector.sh 10000 >> /home/cln/cln-prune-protector.log 2>&1 to crontab), it will temporary disable bitcoind network activity if CLN is falling too much behind. https://github.com/kristapsk/cln-scripts/blob/master/cln-prune-protector.sh

The best solution would if C-Lightning just implemented the getblockfrompeer RPC call that was recently added to Bitcoin.

Kinda sounds right, but from my experience it will make CLN sync a lot slower, as at for most of the sync time it will ask for every block that way.

jb55 commented 2 years ago

slow is better than broken. there has been ideas thrown in the past about using keep-blocks but then you run into disk space back pressure which might run out. I see your script is turning the network on and off... seems a bit extreme but it's an interesting approach.

kroese commented 2 years ago

@kristapsk Yes, I saw your script and really liked it. But since I am running both Bitcoin and C-Lightning in separate docker containers, I would need to heavily modify the script to be able to use it from the host machine.

Also I am not sure if the script will make the process 100% watertight. Because it would require a garantuee that CLN received all channel gossip before reaching the related blocks. But if it receives an additional old channel after that, it will still fail to get block. I don't know if there is a way to be sure that you received all gossips about every channel ever created. And even if there is, there is always the possibility that someone broadcast a new channel with a very old funding transaction.

kristapsk commented 2 years ago

I would need to heavily modify the script to be able to use it from the host machine.

Not sure about that. What you need is working both bitcoin-cli and lightning-cli on a CLN container. And CLN itself depends on a working bitcoin-cli, right?

Script was actively doing turning on / off during IBD, afterwards it haven't done turning off (but it would if, for example, CLN service would not be running). I have prune=20000 in bitcoin.conf on that specific VPS where I use it.

wtogami commented 2 years ago

prune=anynumber is unsafe with CLN for reasons you identified above.

$ bitcoin-cli help pruneblockchain
pruneblockchain height

Arguments:
1. height    (numeric, required) The block height to prune up to. May be set to a discrete height, or to a UNIX epoch time
             to prune blocks whose block time is at least 2 hours older than the provided timestamp.

Result:
n    (numeric) Height of the last block pruned

Examples:
> bitcoin-cli pruneblockchain 1000
> curl --user myusername --data-binary '{"jsonrpc": "1.0", "id": "curltest", "method": "pruneblockchain", "params": [1000]}' -H 'content-type: text/plain;' http://127.0.0.1:8332/

The dependent app (in this case CLN) should instead be driving bitcoind's pruning with this RPC. With CLN in control of pruning you are never at risk of bitcoind pruning too far ahead.

kroese commented 2 years ago

You are right. But even though I made the mistake of letting Bitcoin sync first, it is still a bug that CLN tried to request the same block for hours in a loop.

It would made have much more sense to skip the blocks and ignore the related channels, instead of going into a deathloop.

ghost commented 2 years ago

I've been running on pruned mode successfully, but periodically it hits this bug. It's weird and appears possibly because of malicious gossip because it is always referencing a block from years ago, when lightning channels were only a glimmer in a nerds eye.

I've found a work around because on it's own it seems to get stuck in a loop requesting a block that doesn't exist and all the other node activity slows down. There are a couple of plugins that are meant to make running on a pruned node more reliable*. Although I have never been able to get btcli4j actually configured properly sync (it seems unable to fetch blocks) - just starting up clightning with that plugin clears the queue on fetching that block and then allows me to start up normally again.

wtogami commented 2 years ago

Sounds like a redundant fallback lookup for old blocks would be a perfect plugin.

kristapsk commented 2 years ago

https://github.com/clightning4j/btcli4j/tree/ecacb049d41e2282c5595e84a6f9db6a601c3bc3

I get "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."

ghost commented 2 years ago

https://github.com/clightning4j/btcli4j/tree/ecacb049d41e2282c5595e84a6f9db6a601c3bc3

I get "This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository."

It is outside this repository - it is from the list of community plugins : https://github.com/lightningd/plugins

vincenzopalazzo commented 2 years ago

@kristapsk @AutonomousOrganization Just use the master branch https://github.com/clightning4j/btcli4j

There are a couple of plugins that are meant to make running on a pruned node more reliable*. Although I have never been able to get btcli4j actually configured properly sync (it seems unable to fetch blocks)

I put all my effort to keep alive and maintain my tool, but I can not dream of the bug that people have if you open an issue I can help you to configure it.

Disclaimer, there isn't really configuration :) just a flag to run in pruning mode

bubelov commented 1 year ago

Just noticed the same issue on my new pruned node. Is there a reason why the official docs on pruning doesn't mention this issue? It looks like a common situation with non-negligible negative consequences.

Is it considered a bug or wontfix? Is it safe to ignore it, assuming bitcoind and lightningd agree on a current block height and it's up-to-date?

djmuhlestein commented 1 year ago

Just to add to the last comment... most comments in this thread suggest you should start lightningd when you start bitcoind. Is starting a lightning node never allowed for someone that has already got a bitcoin client up and running? I tried starting both daemons at the same time but ran into the issue anyway for reasons I think have already been discussed. Additionally though, for running a pruned node I found I could just download a prune snapshot and start from that rather than waiting to sync the entire blockchain. For both cases, it seems lightningd needs to work around this issue.

kroese commented 1 year ago

One year has passed, and this issue is still not fixed :(

Every few weeks I still run into this endless loop of getblock calls, and restarting does not fix it. I have to point clightning to a non-pruned node to fetch that block, and revert back to using the pruning node immediately afterwards.

I think what happens is that sometimes it hears about a very old block through gossip, which triggers the endless loop.

It would be so easy to fix this: just ignore blocks that fail to fetch after X tries. Or otherwise add an option where you can specify the maximum block age and don't even try to fetch them. Or the best solution: use the getblockfrompeer rpc call to automaticly fetch the missing block from a peer, when getblock fails.

@vincenzopalazzo @rustyrussell @cdecker Can one of you please look into this, because either of these three solutions are simple to implement, and would solve this issue.

vincenzopalazzo commented 1 year ago

Or the best solution: use the getblockfrompeer rpc call to automaticly fetch the missing block from a peer, when getblock fails.

I will look into this, thanks

Can one of you please look into this, because either of these three solutions are simple to implement, and would solve this issue.

I will!

kroese commented 1 year ago

The problem with this RPC call is that you have to specify the index of the peer (for example the first peer) and you cannot say that you want it from ANY peer (in case the first peer is also pruned just like yourself). So either you have to implement logic to try all peers, or gamble that your first peer is not pruned.

But even if it just tries the first peer, I would be happy already, because in 90 percent of the cases it will work fine.

EDIT: I added a feature request ( https://github.com/bitcoin/bitcoin/issues/27652 ) to make this possible, but until that is implemented just trying the first peer would be fine.

vincenzopalazzo commented 1 year ago

My intention is to add something experimental to the plugin https://github.com/coffee-tools/folgore. Once we reach a consensus, we can try to integrate it with CLN. The plugin is a good place to experiment.

The original idea is to completely bypass Bitcoin Core if the block is out of range, and fetch the block directly from the network.

kroese commented 1 year ago

There are already multiple plugins that can workaround this issue (like your own btccli4j for example), but I use CLN via a prebuild Docker container, so I cannot install any plugins.

So my hope was that the issue could be fixed in CLN itself, not by using a different backend through a plugin.

Because it is a basicly a possible DoS attack: someone can send a channel funding message to me on purpose, which is referring to a very old block, and bring my node in an endless loop. Switching to a different backend is more like avoiding the problem instead of fixing it.

kroese commented 1 year ago

Ran into this issue again today, really getting tired of it..

I really don't understand why this won't get fixed:

And I really appreciate that @vincenzopalazzo is willing to look into this, but bypassing Bitcoin Core via a plugin seems to be complete overkill.

bubelov commented 1 year ago

Yep, pruned mode shouldn't be advertised if it isn't working, it gives the users wrong expectations and breaks their nodes

benjaminchodroff commented 1 year ago

I worked around this issue with CLN using a pruned bitcoind issue in a docker environment by using btc-rpc-proxy which is available as a docker blockstream/btc-rpc-proxy:latest.

I haved exposed the btc-rpc-proxy docker port 8331, and mount a config directory in /data with the following config.toml in it:

bitcoind_user = "hello" bitcoind_password = "world" bind_address = "0.0.0.0" bind_port = 8331 bitcoind_address = "192.168.1.160" bitcoind_port = 8332

[user.clnuser] password = "clnpassword" allowed_calls = [ "createrawtransaction", "decoderawtransaction", "decodescript", "echo", "estimatefee", "estimatepriority", "estimatesmartfee", "estimatesmartpriority", "getbestblockhash", "getblock", "getblockchaininfo", "getblockcount", "getblockhash", "getblockheader", "getchaintips", "getdifficulty", "getinfo", "getmempoolinfo", "getnetworkinfo", "getrawmempool", "getrawtransaction", "gettxout", "gettxoutproof", "gettxoutsetinfo", "sendrawtransaction", "verifytxoutproof" ]

You can then update the CLN to point to this proxy on 8331 with the clnuser and clnpassword, and it should work with your pruned node while pulling blocks p2p when required.

bubelov commented 5 months ago

Is bcli plugin active by default?

If so, shall we close this issue due to https://github.com/ElementsProject/lightning/pull/7240 being merged?

vincenzopalazzo commented 5 months ago

Correct @bubelov

Fixes by https://github.com/ElementsProject/lightning/pull/7240