Node corrupt after power outage

leilerg commented 5 years ago

Hi, yesterday I suffered a power outage on my bitcoin/lightning node, and after that was resolved clightning could not start again. Was using v0.7.2 on a RasPi3. As such, reproducing may be hard...

Tried rebooting, restarting clightning separately, uninstall/re-install clightning, build v0.7.1 and downgrade, and didn't work. Something is preventing the node from starting, so my config tries to restart it indefinetly.

The debug info (log-level=info) is pretty obscure to me:

2019-09-22T00:18:26.964Z **BROKEN** lightning_gossipd(9250): gossip_store_compact: bad version
2019-09-22T00:18:26.965Z UNUSUAL lightning_gossipd(9250): Gossip store version 0 not 7: removing
2019-09-22T00:18:26.965Z INFO lightning_gossipd(9250): We seem to be missing gossip messages

This is all I get, over and over. Or better, what I used to get. Eventually it stopped producing all three lines and is now only logging the last one, We seem to be missing gossip messages.

hsm_secret is in place, with timestamp when I first set the node up, about a year ago.

lightningd.sqlite3 alsi in place, but timestamp the when I try running the node, so recent.

Any ideas what could I do to restore the node? Should I try removing (with local backup) the two files above and restart? Not too sure what is crucial and what isn't to avoid losing funds, so didn't touch any of that so far. If that doesn't work, how could I just recover the funds? (Though problem remains, I still want to run lightning.)

ZmnSCPxj commented 5 years ago

Is there a gossip_store that is valid?

Your filesystem might be broken. Could you check fsck on the filesystem holding your lightningdir?

leilerg commented 5 years ago

There is a gossip_store, don't know if valid... how can I tell? For me it's a ~1kb file, empty... cat gossip_store returns nothing. It's timestamped at whenever I try running c-lightning.

There may be a problem with the filesystem... but I'm stuck in my attempts to resolve it. I'm running the RasPi with / on an external hdd. Both bitcoind and lightningd are on the same disk, so cannot unmount to fsck. Just running sudo fsck gives me:

e2fsck 1.43.3 (04-Sep-2016)
/dev/mmcblk0p2: clean, 36796/913920 files, 303311/3877760 blocks
e2fsck 1.43.3 (04-Sep-2016)
/dev/sda1 is mounted.
e2fsck: Cannot continue, aborting.

fsck.fat 3.0.27 (2014-11-12)
0x41: Dirty bit is set. Fs was not properly unmounted and some data may be corrupt.
1) Remove dirty bit
2) No action
?

If I press 1, I get:

Leaving filesystem unchanged.
/dev/mmcblk0p1: 146 files, 42839/83951 cluste

By going 2,

There are differences between boot sector and its backup.
This is mostly harmless. Differences: (offset:original/backup)
  65:01/00
1) Copy original to backup
2) Copy backup to original
3) No action

and no matter what I do now, it gives me the same message above, after option 1 to remove dirty bit.

Tried running fsck on boot via sudo touch /forcefsck, and I think it runs (not sure cos' I'm headless...) but doesn't fix anything.

Running just sudo fsck /dev/sda1 throws a complain since it's mounted, and I cannot unmount it.

Guess the question now is how to force a file system check and fix on boot? I googled, but the suggestions are basically the ones I did above already. Any other ideas?

ZmnSCPxj commented 5 years ago

If it is truly empty, xxd gossip_store will print nothing, otherwise it will hexdump the actual data in it. If the gossip_store is, for example, filled with 0x00 bytes, cat will appear to print nothing at all, but xxd will reveal the truth (this can be used, incidentally, when tools mysteriously complain about your input file being bad when opening it in a text editor shows it is correctly formatted: I once found a weird case where a compiler kept complaining about invalid characters in a source code file, which no amount of staring at the code in a text editor revealed any problem, it turned out a Unicode byte-order mark had managed to get inserted to the start of the source code file, which the text editor removed because it knows Unicode, but which the compiler was absolutely flummoxed by).

You could try rm gossip_store, or mv gossip_store gossip_store.back.20190923, then rerun. If it persists, you may need to do more drastic filesystem checks with the filesystem unmounted, meaning booting on an alternate boot device, or extracting your boot device from the computer and mounting it on a reliable computer. fsck -ck would scan for bad sectors, for example.

leilerg commented 5 years ago

Learning new stuff with every comment... super appreciated! xxd gossip_store give me

0000000: 07                                       .

including the odd . at the end.

In the debug message above it was saying gossip store version 0 not 7, but now it seems to be 7 to me. Maybe that's why I'm not getting the same debug info anymore?

I suspect you're right, will need to fsck with the disk attached to my laptop... hopefully that does something!

ZmnSCPxj commented 5 years ago

This seems to be a valid gossip_store actually, though maybe @rustyrussell can confirm. Is it still crashing afterwards? Maybe startup is just slow? It has to recover the gossip_store by redownloading gossip from the network after all, so maybe slow startup only?

leilerg commented 5 years ago

OK, looks like my drive is fairly corrupt overall. Suffers from some major problems, not just bad blocks... even as much as with the control board, doesn't seem to power up properly, and struggles with that. You can hear the plates trying to spin up, but struggling. Not sure why..

I did manage to power it on and save the contents of the .lightning directory of my clightning user. Is this enough to restore the node on a different machine? If I just copy everything and rebuild/set up a new node, will this be enough?

ZmnSCPxj commented 5 years ago

You only really need the db file lightningd.sqlite3 and the private keys hsm_secret, though do note that if your drive is in really bad shape both may already be corrupted.

leilerg commented 4 years ago

Hi again, so I finally rebuilt my node and got clightning running as well, I connected to another node. I tried to import (i.e. copy/paste) the hsm_secret and lightningd.sqlite3 files into my .lightning but it didn't work. Error I got was

wallet_blocks_rollback: FOREIGN KEY constraint failed

Note, I did get clightning to run before doing it, I connected to another node successfully.

I then thought it may be a problem with the files (corrupted?), so I tried doing the same with two new files, the ones which were generated automatically by clightning. (I backed them up before deleting them.) Got the same message. I also deleted everything from .clightning, just keep the two files, still same error.

Question now is, how do I restore my LN node with just the two files above?

gdistasi commented 4 years ago

How much did you lose?

darosior commented 4 years ago

Hi @leilerg, were you able to recover your database ? If not, have your channels been closed (funding transaction spent) ?

leilerg commented 4 years ago

I did manage to set up another node, and tried to run it by simply copying lightningd.sqlite3 and hsm_secret over, but that didn't seem to do the trick. It then dawned on me that just doing that could trigger a penalty channel close, so I didn't insist. Now I'm waiting for my peers to close the channels since the node is inactive, and has been inactive for ~1 month+ now. But doesn't seem like people are monitoring channels too actively...

Any suggestions how can I go about closing those channels? I found this to retrieve my funds, but the channels must be closed first, no? @ZmnSCPxj

gdistasi commented 4 years ago

That's exactly the problem why the "static channel backup" approach used elsewhere is not ok. Your peer could remain inactive forever and you would never recover your funds (because you don't have the most updated state).

leilerg commented 4 years ago

Well, at this point I don't really have much of an option, so I'll wait. I didn't "put 4BTC in LN", either, so it's not a big deal for me to wait.

I'm monitoring my node using 1ml.com (any other ways?), and some channels have been closed. I guess people will eventually close them.

cdecker commented 4 years ago

Could be that a newer version of c-lightning can recover the database (we added a couple of DB connection startup options that should ensure foreign key constraints don't break). If that's the case I'd suggest starting the node with the --offline option, which will allow you to monitor the state of the channels, but the node will not accept incoming connections, or open outgoing connections, so it'll just make sure to extract the funds from the closes and track them until they are buried. At that point you can decommission the node after withdrawing all the funds on the node :-)

mandelbit commented 3 years ago

I had a similar catastrophic node failure involving a power outage and simultaneous RAID 1 failure.

I spent some time researching the process for recovering funds from a c-lightning node when all is lost but the hsm_secret and found it fairly time-consuming to piece the information together, so I wrote a quick guide that makes use of lots of great resources and docs written by the many legends of lightning active on here. My hope is that it might save someone a bit of time/research in the future (and also act as a reference for myself in case something similar happens in the future and I forget something). Hopefully it may be of use to some people that land here.

I have found that in general there is quite a lot less information/resources/discussion (besides the Lightning Docs) on subjects like this for c-lightning in comparison to LND.

It is primarily intended as a pragmatic guide for in practice recovery, as opposed to a complete theoretical/technical covering of the recovery process.

cdecker commented 3 years ago

Awesome, thanks @mandelbit for taking the time to write this up. You're right that there are fewer resources than LND, but then again we also have a smaller user-base :-) It'd people like you going the extra mile that allows us to improve the situation. So just wanted to thank you for this ^^

mandelbit commented 3 years ago

Awesome, thanks @mandelbit for taking the time to write this up. You're right that there are fewer resources than LND, but then again we also have a smaller user-base :-) It'd people like you going the extra mile that allows us to improve the situation. So just wanted to thank you for this ^^

Not at all, the least I can do after benefiting from all of your work. Huge thanks to you guys for creating such a great piece of software. Very excited about harnessing this stack, and contributing wherever I can.

ElementsProject / lightning

Node corrupt after power outage #3083