Closed davecgh closed 8 years ago
Thanks for doing corruption testing. Bolt definitely should not be corrupting due to power loss.
The huge number of corrupt pages suggests to me that the pages were written out of order in the transaction. Normally bolt writes all the data pages, fsyncs, writes the meta page, then fsyncs again. That way if a failure occurs during writes to the data pages then a meta page is never written which means that those data pages are effectively uncommitted.
I'll do some testing with a USB stick. Can you tell me what USB drive you're using? I don't think it should matter but it's good to narrow down the issue if possible. Also, I'll take a look at how Windows does file syncing to make sure it's correct in Bolt.
I'm about to go to sleep but I'll look at this first thing in the morning.
Also, can you send me the script you used during testing and any steps you performed during the test? e.g. how long did the test run before you pulled the USB?
Thanks for a quick response.
I'm using a (quite old) Kingston DataTraveler 1GB USB stick. I don't have a script at the moment as it was discovered as a part of a larger project I'm working on (storing bitcoin blocks on the file system while using bolt to store the metadata for their locations, headers, transactions indexing, etc).
I will put something smaller together to help track this down as I know how important it is to have a targeted and reproducible test program for situations like this.
EDIT: Oh, I didn't answer your timing question. I have typically been pulling it within 2-3 seconds of starting the writes, however I've waited up to 10 seconds and encountered the same results. Also, I should note that it doesn't happen every time. Most of the time, there are no issues, but I am seeing the corruption issues maybe 1 out of 50 times.
@davecgh I've been running bolt bench
with the following parameters this morning against an Amazon Basics 16GB USB drive:
$ bolt bench --batch-size 1 --count 10000 --work --path /Volumes/USB16/bolt.db
It didn't take me long to hit corruption although the corruption is strange. My meta pages show the following:
// meta0
root: 24
freelist: 25
hwm: 26
txnid: 630
// meta1
root: 7
freelist: 10
hwm: 26
txnid: 629
But in the data file itself it shows nothing written to page 14
or higher. It's all just zeros filled in by truncate()
. Since the previous transaction (meta1
, txnid 629
) had written it's high water mark up to page 26
, there should be data in pages 15 through 26.
So I googled around and it looks like there's a write cache in front of the USB drive on Windows (and on Mac). I can't for the life of me get it disabled on my Mac because the mounting goes through diskutil mount
instead of the standard Unix mount
.
It looks like there's a bunch of articles on Google about disabling it on Windows though. Can you try disabling write caching and see if you still have the issue?
@davecgh @benbjohnson
I am not sure if it is just a USB issue. I did some similar testing with read disk, I found no issues.
Can you reproduce the issue on aws disk? @benbjohnson
@benbjohnson I have been running with write caching disabled already. The issue still periodically happens. I was able to reproduce it with the bench tool as well (although I had to add the --path
option for the bench sub-command as I don't see that on master).
$ git diff bench.go main.go
diff --git a/cmd/bolt/bench.go b/cmd/bolt/bench.go
index 91af960..1e51e76 100644
--- a/cmd/bolt/bench.go
+++ b/cmd/bolt/bench.go
@@ -32,7 +32,12 @@ func Bench(options *BenchOptions) {
}
// Find temporary location.
- path := tempfile()
+ path := options.Path
+ if path == "" {
+ path = tempfile()
+ }
if options.Clean {
defer os.Remove(path)
@@ -368,6 +373,7 @@ type BenchOptions struct {
FillPercent float64
NoSync bool
Clean bool
+ Path string
}
// BenchResults represents the performance results of the benchmark.
diff --git a/cmd/bolt/main.go b/cmd/bolt/main.go
index 183d1f2..7e8e908 100644
--- a/cmd/bolt/main.go
+++ b/cmd/bolt/main.go
@@ -99,6 +99,7 @@ func NewApp() *cli.App {
&cli.Float64Flag{Name: "fill-percent", Value: bolt.DefaultFillPercent, Usage: "Fill percentage"},
&cli.BoolFlag{Name: "no-sync", Usage: "Skip fsync on every commit"},
&cli.BoolFlag{Name: "work", Usage: "Print the temp db and do not delete on exit"},
+ &cli.StringFlag{Name: "path", Usage: "Specify the db path"},
},
Action: func(c *cli.Context) {
statsInterval, err := time.ParseDuration(c.String("stats-interval"))
@@ -121,6 +122,7 @@ func NewApp() *cli.App {
FillPercent: c.Float64("fill-percent"),
NoSync: c.Bool("no-sync"),
Clean: !c.Bool("work"),
+ Path: c.String("path"),
})
},
}}
I've also run into this stress testing.. but in my case I'm going against an SSD on OSX. What can I provide to help debug? I still have the db, but it is 2 gigs..
I've also run into this issue using SSD on Google Compute Engine instances. Bolt build from: https://github.com/boltdb/bolt/commit/033d4ec028192f38aef67ae47bd7b89f343145b5
Any progress on this?
I work with @mdmarek, we've hitting this panic more often now. Same stack trace as posted by the original poster. In our case, we call tx.Commit()
on a timed interval and we've only seen the panic for databases under heavy loads (large commits). We've been able to recover from this, but letting people know we're still hitting it.
Also seeing this error frequently, but bolt check doesn't show any db corruption. Panic is
page 14 already freed:
...
gopkg.in/boltdb/bolt%2ev1.(*freelist).free(0xc208110c00, 0x17c, 0x7f2e83109000)
/go/src/gopkg.in/boltdb/bolt.v1/freelist.go:117 +0x355
gopkg.in/boltdb/bolt%2ev1.(*node).spill(0xc208014690, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/node.go:358 +0x279
gopkg.in/boltdb/bolt%2ev1.(*node).spill(0xc208014620, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/node.go:345 +0x12d
gopkg.in/boltdb/bolt%2ev1.(*Bucket).spill(0xc20810d2c0, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/bucket.go:540 +0x1c4
gopkg.in/boltdb/bolt%2ev1.(*Bucket).spill(0xc2080371f8, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/bucket.go:507 +0xac2
gopkg.in/boltdb/bolt%2ev1.(*Tx).Commit(0xc2080371e0, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/tx.go:154 +0x1ee
gopkg.in/boltdb/bolt%2ev1.(*DB).Update(0xc208038b60, 0xc2080c9810, 0x0, 0x0)
/go/src/gopkg.in/boltdb/bolt.v1/db.go:554 +0x169
@jezell @epsniff @benbjohnson @nicpottier @gyuho I really think we should get this fixed. Can anyone provide a way to reproduce this? I would like to look into it.
@xiang90 I don't have a way to reproduce right now. I'm doing some refactoring and clean up on the test suite in order to get better, long running simulation testing. I think that'll help to reproduce these types of issues.
We're having major troubles with one of our production bolt databases continually dying with an error similar to @jezell
@newhook @jezell can you give me any details about your usage of bolt? DB size, key & value sizes, sequential writes or random writes, delete frequency, nested buckets, etc. Anything really. I'm trying to reproduce the issue but I'm not sure what's triggering it right now. It seems to happen to a few people frequently but other people it doesn't happen at all.
We had been using bolt for a git LFS implementation that was running on top of GlusterFS. We were able to mitigate the issue by migrating everything away from GlusterFS and moving bolt to run directly on top of mounted SSD storage. The git repos were not experiencing any corruption, but the bolt db databases used to track the LFS chunks were being corrupted regularly. I suspect it was related to some kind of concurrency or fsync type problem with Gluster and how Bolt writes to disk, since corruption hasn't occurred since we eliminated Gluster from the equation.
We're running on a RAID 5 3xSSD (INTEL SSDSC2BB800H4) cluster on Ubuntu 14 using LSI MegaRAID SAS 9271-4i. The application was running for a long time (like 4 months or so) until the corruption occurred.
@newhook Can you post your stack trace too? Also, what bolt commit are you using?
@jezell Are you still using gopkg.in/boltdb/bolt.v1
? I wonder if master
would work better.
@benbjohnson Did we fix anything related recently?
panic: page 4 already freed
goroutine 818021 [running]:
github.com/boltdb/bolt.(*freelist).free(0xc2fd236c90, 0x5b0, 0x7ecaaacc4000)
/source/.gopack/vendor/src/github.com/boltdb/bolt/freelist.go:117 +0x355
github.com/boltdb/bolt.(*node).spill(0xc48b570e70, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/node.go:358 +0x279
github.com/boltdb/bolt.(*Bucket).spill(0xc55bcffdc0, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/bucket.go:536 +0x1c4
github.com/boltdb/bolt.(*Bucket).spill(0xc55bcffd80, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/bucket.go:503 +0xac2
github.com/boltdb/bolt.(*Bucket).spill(0xc6581b6b78, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/bucket.go:503 +0xac2
github.com/boltdb/bolt.(*Tx).Commit(0xc6581b6b60, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/tx.go:151 +0x1ee
github.com/boltdb/bolt.(*DB).Update(0xc27dc99380, 0xca1e051d68, 0x0, 0x0)
/source/.gopack/vendor/src/github.com/boltdb/bolt/db.go:554 +0x169
customerio/index/index.(*DBMBolt).UpdateShard(0xc20800d5f0, 0x90da, 0xc5a9d20a50, 0x0, 0x0)
/source/src/customerio/index/index/db_mbolt.go:448 +0xc9
customerio/index/index.func·057(0xc5000090da)
/source/src/customerio/index/index/db_memory.go:359 +0x503
created by customerio/index/index.(*DBMemory).Flush
/source/src/customerio/index/index/db_memory.go:367 +0x34a
@xiang90 There haven't been many code changes except the CoreOS ones recently. v1
is from Nov 2014. There have been a decent number of changes between v1
and v1.1
which released in Oct 2015.
@xiang90 Are you seeing this issue?
@benbjohnson No. But i would love to fix it if anyone can reliably reproduce it.
@benbjohnson I'm not sure exactly what version, but chances are it was whatever was current on Jul 8 11:18:40 2015.
@benbjohnson the version @newhook is using is commit 04a3e85793043e76d41164037d0d7f9d53eecae3 (work with him and just checked the vendored source)
@benbjohnson I wrote some code to reproduce this. So far I've gotten to trigger twice after running for about 10mins. Ill post a link in few mins.
It's not consistent how long it needs to run before it panics, but it's created the panic 3 out of 3 times I've run it... Two times within 10 mins, but once it ran longer.. https://gist.github.com/epsniff/f919369dccb4d5c13504
Im running on an 8 core i7 cpu, 8GB of ram, with a Sony SSD drive. Running Ubuntu 14, but Ive seen it on Debian too.
@epsniff Can you try to reproduce (confirm it can be reproduced) on a Cloud env? Or I can try to do it next week. Then it would be easier for us to look into the issue.
@xiang90 I was able to reproduce it on a Google Compute Engine(GCE) instance of type n1-standard-4 (4 vCPUs, 15 GB memory)
. With a 200GB (standard persistent disk). Go 1.5.2, Debian 7. The version of BoltDB has commit 033d4ec028192f38aef67ae47bd7b89f343145b5
as the last commit.
@epsniff Awesome! That helps a ton. Also, what Go version are you running? And do you know what file system your /tmp
directory is running on?
Go 1.5.2 And /tmp
was a standard persistent disk, which is equivalent to AWS's EBS General Purpose (SSD)
disks I believe.
@epsniff Great! I will try and get back to you guys next week.
The disks were formatted with ext4
...
@epsniff I'm able to get the panic on my MBP after about 20m. That gives me something to work with! :boom:
@benbjohnson, @xiang90 did either of you ever get a chance to work on this one?
@epsniff Sorry. I should have. I will try this week.
@epsniff I've spent time on it and made progress to reproduce it consistently but I haven't had time to resolve the issue yet.
I found the issue. Fixed with https://github.com/boltdb/bolt/pull/539.
@epsniff Thanks again for the program to reproduce. I refactored it a bit to continuously run in smaller chunks until it panics and then it saves the transaction log so it can be replayed:
https://gist.github.com/benbjohnson/9d2ebbc90b8b52f3fe25
I ran the script to reproduce for 24h without any issue. Previously it failed in 10-20m. If anyone else wants to test the changes, let me know.
That is awesome, thanks so much Ben!
@benbjohnson thank you for fixing it.
We are using boltdb vendored at commit 583e8937c61f1af6513608ccc75c97b6abdf4ff9 (v1.3.0). Unfortunately, I just got this panic, running gomobile on iPhone 6s, iOS 10 (go1.7) We are doing multiple write transactions with fsync disable.
panic: page 311 already freed
goroutine 16 [running]:
panic(0x100d91820, 0x10e3bc710)
/usr/local/go/src/runtime/panic.go:500 +0x390
github.com/.../vendor/github.com/boltdb/bolt.(*freelist).free(0x10d84ec00, 0x3de, 0x12fb9c000)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/freelist.go:117 +0x280
github.com/.../vendor/github.com/boltdb/bolt.(*node).spill(0x10e458540, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/node.go:363 +0x24c
github.com/.../vendor/github.com/boltdb/bolt.(*node).spill(0x10e247b90, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/node.go:350 +0x108
github.com/.../vendor/github.com/boltdb/bolt.(*Bucket).spill(0x10e33f300, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/bucket.go:541 +0x190
github.com/.../vendor/github.com/boltdb/bolt.(*Bucket).spill(0x10da5c2b8, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/bucket.go:508 +0x840
github.com/.../vendor/github.com/boltdb/bolt.(*Tx).Commit(0x10da5c2a0, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/tx.go:163 +0x19c
github.com/.../vendor/github.com/boltdb/bolt.(*DB).Update(0x10d6d0b40, 0x10ec0ba18, 0x0, 0x0)
/Users/zllak/work/go/src/github.com/.../vendor/github.com/boltdb/bolt/db.go:602 +0x12c
...
Should I open a new issue or we could reopen this one ? cc @benbjohnson @xiang90 @davecgh
In the test speed also encountered
I will try to be able to reproduce
"C:\Program Files (x86)\JetBrains\Gogland 163.10615.6\bin\runnerw.exe" C:/Go\bin\go.exe run G:/编程/golang/src/github.com/gamexg/go-test/数据库/boltdb/1.go
panic: page 178 already freed
goroutine 8846 [running]:
panic(0x4c8980, 0xc04235b890)
C:/Go/src/runtime/panic.go:500 +0x1af
github.com/boltdb/bolt.(*freelist).free(0xc04203ff50, 0x2254, 0x2522000)
G:/编程/golang/src/github.com/boltdb/bolt/freelist.go:117 +0x2c7
github.com/boltdb/bolt.(*node).spill(0xc04230a620, 0xc042462280, 0xc042462280)
G:/编程/golang/src/github.com/boltdb/bolt/node.go:358 +0x1e3
github.com/boltdb/bolt.(*node).spill(0xc04230a5b0, 0xc0424d6630, 0xc0421119e0)
G:/编程/golang/src/github.com/boltdb/bolt/node.go:345 +0xbe
github.com/boltdb/bolt.(*Bucket).spill(0xc0423bae80, 0xc0424d6500, 0xc042111c50)
G:/编程/golang/src/github.com/boltdb/bolt/bucket.go:541 +0x442
github.com/boltdb/bolt.(*Bucket).spill(0xc04206d898, 0x219cf650, 0x566800)
G:/编程/golang/src/github.com/boltdb/bolt/bucket.go:508 +0x942
github.com/boltdb/bolt.(*Tx).Commit(0xc04206d880, 0x0, 0x0)
G:/编程/golang/src/github.com/boltdb/bolt/tx.go:163 +0x12c
github.com/boltdb/bolt.(*DB).Update(0xc042078000, 0xc04225bed0, 0x0, 0x0)
G:/编程/golang/src/github.com/boltdb/bolt/db.go:602 +0x114
github.com/boltdb/bolt.(*batch).run(0xc042249300)
G:/编程/golang/src/github.com/boltdb/bolt/db.go:722 +0x106
github.com/boltdb/bolt.(*batch).(github.com/boltdb/bolt.run)-fm()
G:/编程/golang/src/github.com/boltdb/bolt/db.go:696 +0x31
sync.(*Once).Do(0xc042249310, 0xc04225bf58)
C:/Go/src/sync/once.go:44 +0xe2
github.com/boltdb/bolt.(*batch).trigger(0xc042249300)
G:/编程/golang/src/github.com/boltdb/bolt/db.go:696 +0x53
github.com/boltdb/bolt.(*batch).(github.com/boltdb/bolt.trigger)-fm()
G:/编程/golang/src/github.com/boltdb/bolt/db.go:666 +0x31
created by time.goFunc
C:/Go/src/time/sleep.go:154 +0x4b
exit status 2
windows 7 in esxi@nfs
package main
import (
"log"
"fmt"
"time"
"os"
"github.com/boltdb/bolt"
)
func test1() {
// Open the my.db data file in your current directory.
// It will be created if it doesn't exist.
os.Remove("test1.db")
db, err := bolt.Open("test1.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
defer db.Close()
err = db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucketIfNotExists([]byte("MyBucket"))
return err
})
if err != nil {
}
sTime := time.Now()
for i := 0; i < 10000; i++ {
err = db.Update(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("MyBucket"))
err := b.Put([]byte(fmt.Sprint("key", i)), []byte("00000000000000000000"))
return err
})
}
//每次创建 Bucket 耗时: 33.2749032s
fmt.Println("每次创建 Bucket 耗时:", time.Now().Sub(sTime))
}
func test2() {
// Open the my.db data file in your current directory.
// It will be created if it doesn't exist.
os.Remove("test2.db")
db, err := bolt.Open("test2.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
defer db.Close()
err = db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucketIfNotExists([]byte("MyBucket"))
return err
})
if err != nil {
}
sTime := time.Now()
for i := 0; i < 10000; i++ {
err = db.Batch(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("MyBucket"))
err := b.Put([]byte(fmt.Sprint("key", i)), []byte("00000000000000000000"))
return err
})
}
fmt.Println("批量操作:", time.Now().Sub(sTime))
}
func test3() {
// Open the my.db data file in your current directory.
// It will be created if it doesn't exist.
os.Remove("test3.db")
db, err := bolt.Open("test3.db", 0600, nil)
if err != nil {
log.Fatal(err)
}
defer db.Close()
err = db.Update(func(tx *bolt.Tx) error {
_, err := tx.CreateBucketIfNotExists([]byte("MyBucket"))
return err
})
if err != nil {
}
sTime := time.Now()
err = db.Update(func(tx *bolt.Tx) error {
b := tx.Bucket([]byte("MyBucket"))
for i := 0; i < 10000; i++ {
err := b.Put([]byte(fmt.Sprint("key", i)), []byte("00000000000000000000"))
if err != nil {
return err
}
}
return nil
})
//单次事务执行: 100.0057ms
fmt.Println("单次事务执行:", time.Now().Sub(sTime))
}
func main() {
//test1()
test2()
}
The g-drive is a mapped NFS @ FreeNAS network drive. In the g-disk test will be panic. Panic does not appear on the local disk test.
I've been doing a bit of corruption testing under unexpected power off scenarios (using an external USB drive and disconnecting it midstream) and, unfortunately, I'm fairly consistently hitting unrecoverable database corruption issues due to things such as
unreachable unfreed
oralready freed
. I understand write corruption in this scenario is expected, but the database shouldn't be left in an unrecoverable state.EDIT: I should also note this is on Windows 7 using all default bolt options.
I have a 4KiB database that is corrupted from this scenario that I'll keep around to help debug. To get started, I'm providing the following pertinent details:
bolt pages
commandbolt check
command. NOTE: This command hangs and never completes