Chia-Network / chia-blockchain

Chia blockchain python implementation (full node, farmer, harvester, timelord, and wallet)
Apache License 2.0
10.81k stars 2.03k forks source link

[BUG] `pragma synchronous=OFF` causes database corruption on crash #8715

Closed stonemelody closed 2 years ago

stonemelody commented 3 years ago

Version 1.2.8/9 began using pragma synchronous=OFF every time a connection to the sqlite database is opened. This is unsafe if the machine crashes and will result in users having to resync the blockchain/wallet from scratch. This can cause users to lose days of farming time as they wait for the database to resync as well as waste network bandwidth and possibly waste data on their network plan with a data cap. The sqlite page for pragmas details why this is unsafe (sorry, you'll have to search for synchronous as I can't link directly to it). Namely, it does not wait for the kernel to notify sqlite that data is persisted, only that it has been submitted to the kernel.

To Reproduce

  1. launch any chia component that writes data to the sqlite database (full node, wallet), preferably something that needs to sync
  2. wait for chia to begin syncing data
  3. randomly unplug/hard shutdown the machine without waiting for chia to properly stop
  4. restart chia, database may be reported as corrupted

As the above depends on the precise timing of when the machine was powered off/crashed relative to I/O submitted by sqlite, this may require several attempts to see a corrupted database.

Expected behavior Database should not become corrupted if the machine crashes. Users should not have to resync the entire blockchain and their wallet if their machine crashes.

Desktop Any machine that writes data to any sort of storage device

Additional context While this doesn't directly explain the numerous database corruption issues seen by people recently (including some in #8694 as those were not caused by a computer crash), there's a good chance that either this, or this combined with some transaction management issue could cause database corruption if chia is stopped improperly.

The changelog for 1.2.8 indicates that this pragma change was added to improve disk performance. If faster transaction commits are desired, multiple operations should be batched into a single transaction instead of completely disabling the crash consistency mechanism of the database. Since sqlite is run in WAL mode, sqlite could be set to use pragma synchronous=NORMAL which the documentation seems to indicate would provide sufficient crash consistency while being faster than pragma synchronous=FULL

stonemelody commented 3 years ago

@bdowne01 has also had some personal experience related to blockchain db corruption that is likely related to this

bdowne01 commented 3 years ago

Should note that a physical power issue is one possibility. A SIGKILL signal on the full_node process can cause this too. The corruption on my end was caused by a kill -9, when doing debug on the chia_full_node proc for other reasons.

jack60612 commented 3 years ago

Listen I unplugged my chia node drive for Q.A testing and my db was fine.

bdowne01 commented 3 years ago

Listen I unplugged my chia node drive for Q.A testing and my db was fine.

Unplugging the drive will cause the OS kernel to do a hot-remove on SATA/SAS buses. Not the same as a power crash.

jack60612 commented 3 years ago

Listen I unplugged my chia node drive for Q.A testing and my db was fine.

Unplugging the drive will cause the OS kernel to do a hot-remove on SATA/SAS buses. Not the same as a power crash. Let me clarify what i meant. I killed the power to the drive as it was powered separately

jack60612 commented 3 years ago

A corrupted db recovery system is coming soon TM anyway

bdowne01 commented 3 years ago

I killed the power to the drive as it was powered separately

Thanks for the clarification. To replicate this issue, you must kill power the the operating system kernel, not the drive.

jack60612 commented 3 years ago

From the testing and careful deliberation that we did they decided that having it as off is the best course of action due to the insane performance improvements.

stonemelody commented 3 years ago

Saying that this was done purely for performance reasons and then ignoring or writing off all the problems it is causing is frankly disappointing. I know that the chia team has been working hard to improve performance, but that doesn't mean that it's alright to throw caution to the wind. As a software developer as well, I know that writing good, performant code is difficult, especially with a codebase this size. As a software developer who has done lots of work with crash consistency and who understands quite well the performance tradeoffs that crash consistency causes, I really can't buy the argument that an insane performance increase is worth a decent chance of corrupting the entire database if a crash happens. Sure, running anything without crash consistency will make it faster. But that doesn't mean that you should run all of your computers without ext4 journaling or the equivalent, because otherwise you're going to have lots of problems.

Apart from that, this update unfortunately also has a good possibility of excluding users who do not have stable power delivery as they cannot guarantee that their machine will not crash at the wrong time. They will be stuck in an endless loop of trying to sync the blockchain fresh, their machine losing power and crashing, and then having to start over once again.

If the chia devs would still like to allow the use of running in the new sqlite mode, adding a config flag and making it default to a crash consistent pragma statement would suffice. That would make sure the majority of users didn't have to deal with random database corruption and those that knew what they were doing or wanted the best performance could change it. If the chia devs don't have time for this then I'll find the time to make a PR for it myself

jack60612 commented 3 years ago

Saying that this was done purely for performance reasons and then ignoring or writing off all the problems it is causing is frankly disappointing. I know that the chia team has been working hard to improve performance, but that doesn't mean that it's alright to throw caution to the wind. As a software developer as well, I know that writing good, performant code is difficult, especially with a codebase this size. As a software developer who has done lots of work with crash consistency and who understands quite well the performance tradeoffs that crash consistency causes, I really can't buy the argument that an insane performance increase is worth a decent chance of corrupting the entire database if a crash happens. Sure, running anything without crash consistency will make it faster. But that doesn't mean that you should run all of your computers without ext4 journaling or the equivalent, because otherwise you're going to have lots of problems.

Apart from that, this update unfortunately also has a good possibility of excluding users who do not have stable power delivery as they cannot guarantee that their machine will not crash at the wrong time. They will be stuck in an endless loop of trying to sync the blockchain fresh, their machine losing power and crashing, and then having to start over once again.

If the chia devs would still like to allow the use of running in the new sqlite mode, adding a config flag and making it default to a crash consistent pragma statement would suffice. That would make sure the majority of users didn't have to deal with random database corruption and those that knew what they were doing or wanted the best performance could change it. If the chia devs don't have time for this then I'll find the time to make a PR for it myself

https://github.com/Chia-Network/chia-blockchain/pull/8319 for more info

stonemelody commented 3 years ago

I am unsure what you are trying to point me to with that. I already know very well that removing fdatasync and fsync calls will speed up applications because it won't wait for I/O completions. The pragma that you added removes them from the strace output because it literally tells sqlite not to issue them. As I have said, crash consistency is one of the things I am very familiar with and I've spent a lot of time reasoning about the minimal number of sync calls needed to make my own little kv-store crash consistent and efficient.

Sweeping this under the rug with "it will be faster to sync from scratch" or "we will back up the db file" still doesn't fix the problem and also ignore the fact that the db file is quite large. A backup would take up an extra 20+ GB now which means it could easily be hundreds of GB in a few years if no light clients come out and further improvements are made. A similar argument applies when one thinks about syncing from scratch on a network plan that has a data cap.

This also ignores the possibility of sqlite not catching database corruption, which means that a node could have incorrect data that is not discovered until much much later, when something goes catastrophically wrong for seemingly unknown reasons. sqlite does not verify that every entry in the entire database is properly formatted, so depending on when power was lost, things can get very strange

jack60612 commented 3 years ago

I am unsure what you are trying to point me to with that. I already know very well that removing fdatasync and fsync calls will speed up applications because it won't wait for I/O completions. The pragma that you added removes them from the strace output because it literally tells sqlite not to issue them. As I have said, crash consistency is one of the things I am very familiar with and I've spent a lot of time reasoning about the minimal number of sync calls needed to make my own little kv-store crash consistent and efficient.

Sweeping this under the rug with "it will be faster to sync from scratch" or "we will back up the db file" still doesn't fix the problem and also ignore the fact that the db file is quite large. A backup would take up an extra 20+ GB now which means it could easily be hundreds of GB in a few years if no light clients come out and further improvements are made. A similar argument applies when one thinks about syncing from scratch on a network plan that has a data cap.

This also ignores the possibility of sqlite not catching database corruption, which means that a node could have incorrect data that is not discovered until much much later, when something goes catastrophically wrong for seemingly unknown reasons. sqlite does not verify that every entry in the entire database is properly formatted, so depending on when power was lost, things can get very strange

A corrupted db recovery system is coming soon TM anyway

this

stonemelody commented 3 years ago

A corrupted DB recovery system relies on:

  1. being able to actually detect the corruption quickly
  2. sqlite actually allowing one to open a corrupted database file
  3. the corruption being limited only to application data and not affecting sqlite metadata at all

The first point is usually implemented by checksumming the entire file as that's much quicker than checking each individual entry. 2. is up to sqlite and if sqlite does not allow opening a corrupted db file, then the user must sync from scratch. 3. is probably never going to be true because both application data and sqlite metadata must be updated on pretty much every write transaction. There are probably a few exceptions to this, but it is more likely than not that both sqlite metadata and/or application data will be corrupted

jack60612 commented 3 years ago

And in addition i'm also not the right guy to talk to. so yep

bdowne01 commented 3 years ago

Also #8514

That was the PR that switched it to off, it was set to normal on the PR jack mentioned.

athena9 commented 3 years ago

I thought it was something I was doing my end, then found out this is by design? WTF.

emlowe commented 3 years ago

There was much internal debate about this. I'll note that the sqlite docs clearly say that program crashes will not cause corruption in this mode, which was one rationale for including this: If the application running SQLite crashes, the data will be safe - this would include kill -9. We will continue testing, but if you can consistently cause DB corruption will such kill -9 please let us know.

Yes, we are aware that a power loss in this case could cause corruption. We have been watching our normal channels for corruption issues, and we haven't seen a noticeable increase in reports at this time.

It's entirely possible that we need to move back to NORMAL, but we were enticed by the clear performance wins for OFF.

stonemelody commented 3 years ago

yea, no... as someone who's spent quite a bit of time on crash consistency stuff, turning off things like this is pretty much always more trouble than it's worth. There's so many additionally edge and error cases that need to be checked because you basically can't assume that any of the data you have is correct anymore. One would hope that the data for transaction n - 1 is safe because it's not in transaction n and was committed before transaction n was and the power failure happened when committing transaction n. But if they were both written close to each other (i.e. within the default 5min flush window for LInux), that may actually not be the case...

It's also quite possible that you don't see as many reports because people may reach out to pool chats (assuming they are pooling) instead of the channels that y'all monitor. At least in the pool discord I'm a part of, we know about this, and we'll just tell people the unfortunate news that they need to sync from scratch if their database is corrupted

arvidn commented 3 years ago

I get the impression that there is broad agreement of the basic fact that using synchronous=OFF increases the risk of data loss (corruption of the DB). Specifically in the event of power loss, yanked/loose connection USB hard-drive or kernel panic.

I think it's safe to say that the performance improvement of doing this is worth it, where those events are very unlikely (e.g. the DB is on an internal drive, computer is located in a place where power outages are very rare and you run a reliable operating system). One important factor here is that the data that may be lost, isn't unique. It can be recovered from the network, and so the failure isn't catastrophic.

In situations where those events are not likely, this trade-off may not be worth it. (as the author of that PR, I was a bit surprised it landed so easily, I had expected some push back on making it configurable). I think it's important to recognize that the likelihood of data loss isn't black and white, even with full sync enabled, there may be additional caching layers in low-end drives that are not resilient to power loss, but those risks are probably in the margins compared to this main issue.

I think at the very least this needs to be configurable, but I think we can do better. I'm very interested in hearing what people think, especially with what kinds of setups using synchronous=OFF is problematic.

The setting could have 3 values, on, off and auto, defaulting to auto. In the auto mode we could have a heuristic to try to make and educated guess, just based on the system we're running on. e.g.

How does that sound?

athena9 commented 3 years ago

I appreciate the trade-offs between reliability and performance, but seeing the impact of bad data hitting mempool in testnet is this definitely production ready?

Any change that decreases stability needs to be matched with safety mechanisms:

stonemelody commented 3 years ago

"Catastrophic" is a very subjective view. Sure, the data can be fetched from the blockchain again, but that ignores the time lost actually farming and attempting to either submit blocks to the chain or submit partials on a pool. A pooler in the pool I'm a part of had their database trashed and missed out on a lucky day yesterday, I'd say that's unpleasant/unfortunate at best. The whole "just resync the chain again" also still ignores the fact that the database is big and will keep getting bigger. I don't want to have to downloads 10s-100s GB of blockchain data on a semi-regular basis just because something went wrong with my power for a second.

As for low-end drives having extra caching, sqlite is built to take care of most of that already. The WAL in sqlite is a standard way to handle crash consistency on devices that have an atomic powerfail write unit that is smaller than the total data that needs to be written for a transaction. There's fsyncs in sqlite to ensure the WAL is actually persisted before it starts mucking with the actual transaction execution so that it can recover if things go wrong. The only time this would fail on a low-end device is if the USB controller flat-out lied about doing an fsync. In that case, the person should probably get a better device as it's not just sqlite that would have problems in the event of a crash. Drives themselves have gotten much better at actually honoring the low-level commands that fsync triggers, so if there's a problem, it's going to be the USB controller, not the drive itself.

For problematic setups, I think a simple question would be: is synchronous=OFF safe when Windows update decides to restart a machine randomly? Apart from that, you'll end up with a lot of disparate answers as something as simple as a BSOD or power failure can break the database. There does not seem to be one single "right" hardware that people in the community use to run chia, so I don't think there will be many patterns in this.

As far as heuristics, the bulk of it could probably be covered by possibly running with synchronous=OFF until 1/2 or 3/4 of the blockchain is synced. After that just run it with synchronous=NORMAL. synchronous=OFF is most beneficial when there's lots of transactions going into the database anyway, which is mostly caused by the initial sync. If normal operation also causes load, then like I said earlier, other routes like better transaction management should be used to increase performance.

With the technical stuff aside, this is absolutely be a config option if the chia team wants to keep pushing for the use of synchronous=OFF so that those who don't want to worry about it don't have to patch the codebase every version. From a UX perspective, the default should be some crash-consistent option like synchronous=NORMAL because many people 1. won't change the defaults, 2. probably won't have a good enough understanding of the intimate workings of storage to know why synchronous=OFF is unsafe, and 3. will be very unhappy/may leave the project if their database files appear to randomly corrupt themselves. All of the above reasons are the same for why all semi-recent major file systems run with some sort of crash consistency mechanism by default. It turns out that people are really unhappy when they have to run fsck on their multi-TB drive due to a crash or they find they can no longer boot their machine due to a crash.

stonemelody commented 3 years ago

@athena9 that's a bit of an over-simplification for mitigation strategies. Unfortunately the program will struggle with:

  1. not knowing what operations were in-flight when a crash happened
  2. not knowing what the expected database state should be
  3. not knowing what parts of in-flight operations were completed
  4. possibly opening the database file at all

Since there's no fsyncs at all from sqlite, there's no guarantee that transactions committed in order and atomically. Therefore, if the database can be opened at all, it's not clear what state the data would be in. There could be parts of some transactions missing, and newer transactions could be persisted while later ones are not. Basically, one cannot say anything about the state of the data. I posted above about what it would require to even begin to have a recovery system for this, and it's not something that can be implemented easily (and probably not at all to be honest)

cimrhanzel commented 3 years ago

yea, no... as someone who's spent quite a bit of time on crash consistency stuff, turning off things like this is pretty much always more trouble than it's worth. There's so many additionally edge and error cases that need to be checked because you basically can't assume that any of the data you have is correct anymore. One would hope that the data for transaction n - 1 is safe because it's not in transaction n and was committed before transaction n was and the power failure happened when committing transaction n. But if they were both written close to each other (i.e. within the default 5min flush window for LInux), that may actually not be the case...

It's also quite possible that you don't see as many reports because people may reach out to pool chats (assuming they are pooling) instead of the channels that y'all monitor. At least in the pool discord I'm a part of, we know about this, and we'll just tell people the unfortunate news that they need to sync from scratch if their database is corrupted

I have to agree with you. I also support one of the pools and each version update has become more and more of a joke. There seems to be no testing prior to a release and the current version of 1.2.8 to 1.2.9, 80% of the users are met with the "spinners" and we have to tell them, I am sorry, you will have to delete your databases and maybe in 3 or 4 days you can be synced again.

2021-10-06T12:18:48.825 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77d40> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:49.970 harvester chia.harvester.harvester: INFO refresh_batch: loaded_plots 0, loaded_size 0.00 TiB, removed_plots 0, processed_plots 0, remaining_plots 0, duration: 0.00 seconds 2021-10-06T12:18:50.903 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:50.908 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1eac0> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:51.891 wallet wallet : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:51.896 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77a40> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:53.909 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:53.916 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1e540> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:54.997 wallet wallet : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:55.002 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77240> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:56.965 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:56.968 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1ec40> [Connect call failed ('127.0.0.1', 8444)]

You spin me right 'round, baby Right 'round like a record, baby

Screen Shot 2021-10-06 at 12 15 04 PM

bdowne01 commented 3 years ago

The setting could have 3 values, on, off and auto, defaulting to auto. In the auto mode we could have a heuristic to try to make and educated guess, just based on the system we're running on. e.g.
..snip... How does that sound?

As a user, I think that sounds great! My humble suggestion would be to default to the safest option (SQLite docs seems to indicate 'normal' would be totally safe), whilst allowing advanced users to 'take off the seatbelt' if they choose to go faster (perhaps in lieu of -- or complementing an 'auto' setting). That way the risk is acknowledged and adverse outcomes can be expected.

On the general performance front there may be opportunity to SQL-tune the INSERT statement method for significant gains which could make up the difference too. Some preliminary glances at coin_store.py code show individual INSERT statements are being executed; whereas batch of those could (should?) be wrapped into a transaction and executed instead. There is some in-depth analysis here on SQLite INSERT performance tricks, but that's probably best dropped in different GH Issue altogether.

arvidn commented 3 years ago

@cimrhanzel I don't believe that has anything to do with a corrupt database. I believe that's caused by another issue (which I'm also looking into fixing) where we read every block header from the DB to build a height->block mapping. The startup time will increase as the chain gets longer, so what we do now is not sustainable. I don't have any reason to believe it's caused by corrupt DB though (but please share if you do).

arvidn commented 3 years ago

@bdowne01 yes, I've looked at the insert performance stack overflow post. We already do all these inserts as a transaction, and my attempt to make it cheaper by removing an unused index didn't impact it very much. But I'm still looking at improving the inserts.

arvidn commented 3 years ago

https://github.com/Chia-Network/chia-blockchain/pull/8753

stonemelody commented 3 years ago

@bdowne01 yes, I've looked at the insert performance stack overflow post. We already do all these inserts as a transaction, and my attempt to make it cheaper by removing an unused index didn't impact it very much. But I'm still looking at improving the inserts.

I was skimming the code yesterday, and it looks like the current code is doing a transaction per-block added to the block store. It would be much more efficient to do a transaction per-batch of blocks added to the store.

Each transaction commit in sqlite must perform multiple flush operations, so by batching block inserts, fewer fsyncs will need to be issued. Some rough estimates (I have not recently read through the sqlite code to see the order of things) for number of fsyncs per commit would be:

  1. flush WAL to disk for recovery
  2. flush (most likely) in-place updates to data
  3. flush truncation/removal of WAL file

There may be a few more in there, but that's probably close to the minimum number of flushes per commit

cimrhanzel commented 3 years ago

@cimrhanzel I don't believe that has anything to do with a corrupt database. I believe that's caused by another issue (which I'm also looking into fixing) where we read every block header from the DB to build a height->block mapping. The startup time will increase as the chain gets longer, so what we do now is not sustainable. I don't have any reason to believe it's caused by corrupt DB though (but please share if you do).

The

"2021-10-06T12:18:50.903 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:50.908 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1eac0> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:51.891 wallet wallet : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:51.896 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77a40> [Connect call failed ('127.0.0.1', 8444)]"

On my test machine can run for days without connecting. I have noticed, from testing, that a 1.2.7 database has no problem connecting to a 1.2.7 client, but that that same 1.2.7 db and try to run it on a 1.2.9 machine will never start. At least on MacOS I have been using the Chia client since mainnet went live and have worked around a lot of issues but this one I can not get past without a complete resynch of the database .

stonemelody commented 3 years ago

@cimrhanzel I don't believe that has anything to do with a corrupt database. I believe that's caused by another issue (which I'm also looking into fixing) where we read every block header from the DB to build a height->block mapping. The startup time will increase as the chain gets longer, so what we do now is not sustainable. I don't have any reason to believe it's caused by corrupt DB though (but please share if you do).

The

"2021-10-06T12:18:50.903 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:50.908 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1eac0> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:51.891 wallet wallet : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:51.896 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77a40> [Connect call failed ('127.0.0.1', 8444)]"

On my test machine can run for days without connecting. I have noticed, from testing, that a 1.2.7 database has no problem connecting to a 1.2.7 client, but that that same 1.2.7 db and try to run it on a 1.2.9 machine will never start. At least on MacOS I have been using the Chia client since mainnet went live and have worked around a lot of issues but this one I can not get past without a complete resynch of the database .

does running with strace show it repeatedly trying and failing at something? You'll probably need to attach strace to the process(es) since chia runs in the background

emlowe commented 3 years ago

@cimrhanzel I don't believe that has anything to do with a corrupt database. I believe that's caused by another issue (which I'm also looking into fixing) where we read every block header from the DB to build a height->block mapping. The startup time will increase as the chain gets longer, so what we do now is not sustainable. I don't have any reason to believe it's caused by corrupt DB though (but please share if you do).

The

"2021-10-06T12:18:50.903 farmer farmer : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:50.908 farmer farmer_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x112e1eac0> [Connect call failed ('127.0.0.1', 8444)] 2021-10-06T12:18:51.891 wallet wallet : INFO Reconnecting to peer {'host': '127.0.0.1', 'port': 8444} 2021-10-06T12:18:51.896 wallet wallet_server : INFO Cannot connect to host 127.0.0.1:8444 ssl:<ssl.SSLContext object at 0x109b77a40> [Connect call failed ('127.0.0.1', 8444)]"

On my test machine can run for days without connecting. I have noticed, from testing, that a 1.2.7 database has no problem connecting to a 1.2.7 client, but that that same 1.2.7 db and try to run it on a 1.2.9 machine will never start. At least on MacOS I have been using the Chia client since mainnet went live and have worked around a lot of issues but this one I can not get past without a complete resynch of the database .

Some other troublshooting steps. Set your log to DEBUG chia configure --log-level DEBUG and turn on the sql logging in config.yaml (under full_node: set log_sqlite_cmds: True. Now, just start the node and not farmer or wallet. chia start node

xklech commented 3 years ago

image

Second time in4 days. now i have to sync full node for 2 days again .... guys ... chia is not in usable state right now. going to downgrade to 1.2.6 as i did not see this problem there.

arvidn commented 3 years ago

@xklech do you have a reason to believe that issue is caused by a corrupt database?

xklech commented 3 years ago

@arvidn Every time happens after hard power off. (breaker and GF instruated to hold button for few secconds if she cant sleep) at first occurence there were messages in log that database is corrupted and deamon was not starting. At last occurrence could not see any error in log at all but was not working. Still trying to reconnect to deamon in loop for hour. Deleting databses in .chia and syncing again helped. "Helped" == at least something started to happen, still had to sync from scratch

stonemelody commented 3 years ago

is there any plan to actually cut a release with the merged change soon? Just because the change has merged to main doesn't mean that people will pick up the changes in a non-release version. Nor does it mean the problem no longer exists for the majority of users