cloudant / bigcouch

Putting the 'C' back in CouchDB
http://bigcouch.cloudant.com/
Apache License 2.0
565 stars 52 forks source link

DB corrupt after Power Failure #141

Open gauravsaini03-zz opened 7 years ago

gauravsaini03-zz commented 7 years ago

I found some issue regarding DB corrupt after power failure while working on a project that is using bigcouch. I have already asked question in project's mailing list, but they suggested to get back to bigcouch community for help.

I also noticed, usually it happens when we use SSD hard disk on the physical machine. Also when I check database at the time of issue, I found only accounts related dbs were missing and all the other dbs like offnet, system_config were available.

So I am not sure if this issue is related to bigcouch or application specific. Is there any way we can avoid such type of problem so that the system we build will be a production ready.

Thanks Gaurav

b20n commented 7 years ago

@gauravsaini03: It's difficult to diagnose based on merely "DB corrupt" but I could imagine a couple ways this could be theoretically feasible.

  1. Something's up at the OS level or application level and some/all files have been removed from disk. That is, not BigCouch's fault. I'd put my money on this option.

  2. The /dbs meta-database had not checkpointed recent data at the time of power loss. This is unlikely given the replication topology of /dbs, but it could be possible, especially if you're running with delayed_commits=true and/or some odd cluster configuration and/or some odd hardware configuration. If this is the case (identifiable by the presence of shards on disk) you could simply add the relevant shard mappings to /dbs on port 5986 and your databases would be resurrected.

gauravsaini03-zz commented 7 years ago

@banjiewen Thanks for looking into this. I understand your point, actually, we are using Kazoo 2600hz with uses bigcouch as it's database. In case of power failure, we found some accounts databases gone missing and all other are in the database. I raised the concern in the 2600hz mailing list but they recommended to discuss once here also. (https://groups.google.com/forum/#!searchin/2600hz-dev/database$20corrupted|sort:relevance/2600hz-dev/TTcvio8-0wg/O8wMcGcvDAAJ)

Also, I have made delayed_commits=false but that didn't worked as well.

Thanks