Closed justinfx closed 5 years ago
I did some extra pprof profiling, and the big hit in CPU time is:
runtime.mallocgc
badger.Iterator.Seek
=> b.Iterator.prefetch()
=> b.Iterator.parseItem()
b.(*blockIterator).Key()
b.y.elemHeap.Less()
Thanks for reporting this @justinfx.
The db has 500k items in a single keyspace
What do you mean by a single keyspace? Are all the keys prefixed (like foo1, foo2, foo3)?
@justinfx I ran the following code snippet and the iterator seems to be fast enough. Am I missing something?
package main
import (
"fmt"
"os"
"strconv"
"time"
"github.com/dgraph-io/badger"
)
func main() {
initialize()
opts := badger.DefaultOptions
opts.Dir = "./badgerdb"
opts.ValueDir = "./badgerdb"
db, err := badger.Open(opts)
if err != nil {
panic(err)
}
txn := db.NewTransaction(false)
it := txn.NewIterator(badger.DefaultIteratorOptions)
start := time.Now()
prefix := []byte("test11")
it.Seek(prefix)
fmt.Println("Time taken for seek", time.Now().Sub(start))
start = time.Now()
it.Rewind()
fmt.Println("Time taken for rewind", time.Now().Sub(start))
start = time.Now()
iopt := badger.DefaultIteratorOptions
// Use prefix iterator
iopt.Prefix = []byte("test2222")
it = txn.NewIterator(iopt)
start = time.Now()
it.Rewind()
fmt.Println("Time taken for rewind with Prefix iterator", time.Now().Sub(start))
os.RemoveAll(opts.Dir)
db.Close()
}
func initialize() {
opts := badger.DefaultOptions
opts.Dir = "./badgerdb"
opts.ValueDir = "./badgerdb"
db, err := badger.Open(opts)
if err != nil {
panic(err)
}
numKeys := 5000000
batch := db.NewWriteBatch()
for i := 0; i < numKeys; i++ {
if err = batch.Set([]byte("test"+strconv.FormatInt(int64(i), 10)), []byte(strconv.FormatInt(int64(i), 10)), 0); err != nil {
panic(err)
}
}
if err := batch.Flush(); err != nil {
panic(err)
}
fmt.Println("Number of keys inserted:", numKeys)
db.Close()
}
and this gives me
➜ foo go run main.go
badger 2019/04/15 19:35:26 INFO: All 0 tables opened in 0s
badger 2019/04/15 19:35:28 DEBUG: Storing value log head: {Fid:1 Len:43 Offset:10066123}
badger 2019/04/15 19:35:31 DEBUG: Storing value log head: {Fid:2 Len:43 Offset:20132289}
badger 2019/04/15 19:35:32 DEBUG: Storing value log head: {Fid:3 Len:43 Offset:30198455}
Number of keys inserted: 5000000
badger 2019/04/15 19:35:34 DEBUG: Storing value log head: {Fid:4 Len:43 Offset:38678701}
badger 2019/04/15 19:35:34 INFO: Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
badger 2019/04/15 19:35:34 INFO: Running for level: 0
badger 2019/04/15 19:35:35 DEBUG: LOG Compact. Added 2201560 keys. Skipped 0 keys. Iteration took: 692.365671ms
badger 2019/04/15 19:35:36 DEBUG: LOG Compact. Added 2195921 keys. Skipped 0 keys. Iteration took: 683.35922ms
badger 2019/04/15 19:35:36 DEBUG: LOG Compact. Added 602522 keys. Skipped 0 keys. Iteration took: 170.211644ms
badger 2019/04/15 19:35:36 DEBUG: Discard stats: map[]
badger 2019/04/15 19:35:36 INFO: LOG Compact 0->1, del 4 tables, add 3 tables, took 2.026911211s
badger 2019/04/15 19:35:36 INFO: Compaction for level: 0 DONE
badger 2019/04/15 19:35:36 INFO: Force compaction on level 0 done
badger 2019/04/15 19:35:37 INFO: All 3 tables opened in 248ms
badger 2019/04/15 19:35:37 DEBUG: Value Log Discard stats: map[]
badger 2019/04/15 19:35:37 INFO: Replaying file id: 4 at offset: 38678744
badger 2019/04/15 19:35:37 INFO: Replay took: 5.34µs
Time taken for seek 151.98µs
Time taken for rewind 120.21µs
Time taken for rewind with Prefix iterator 157.09µs
badger 2019/04/15 19:35:37 INFO: Got compaction priority: {level:0 score:1.73 dropPrefix:[]}
Your issue could be related to CPU or memory. How much memory does your system have? What CPU are you using?
Thanks so much for putting together a test case. You are right that I don't get the same performance issues with this case. I've been trying to rewrite your case to match my application as closely as possible. First let me start with some of the test specs:
Here is my modified test case:
package main
import (
"bytes"
"encoding/binary"
"fmt"
"os"
"time"
"github.com/dgraph-io/badger"
)
const dbdir = "/tmp/dbtesting/badgerdb"
var prefix = []byte(`consumer.queue:c:`)
func main() {
opts := badger.DefaultOptions
opts.Dir = dbdir
opts.ValueDir = dbdir
db, err := badger.Open(opts)
if err != nil {
panic(err)
}
writedb(db)
db.Close()
db, err = badger.Open(opts)
if err != nil {
panic(err)
}
defer db.Close()
defer os.RemoveAll(dbdir)
err = db.View(func(txn *badger.Txn) error {
opts := badger.DefaultIteratorOptions
opts.PrefetchValues = false
it := txn.NewIterator(opts)
defer it.Close()
start := time.Now()
it.Rewind()
fmt.Println("Time taken for rewind", time.Since(start))
start = time.Now()
it.Seek(prefix)
fmt.Println("Time taken for seek", time.Since(start))
if !it.ValidForPrefix(prefix) {
panic("not valid for prefix")
}
count := 0
for ; it.Valid(); it.Next() {
if count%50000 == 0 {
item := it.Item()
fmt.Printf("%s ; size=%v\n", item.String(), item.EstimatedSize())
}
count++
}
fmt.Println("Count:", count)
return nil
})
if err != nil {
panic(err)
}
}
func writedb(db *badger.DB) {
numKeys := 700000
seq, err := db.GetSequence([]byte("consumer.id.counter:c:"), uint64(numKeys))
if err != nil {
panic(err)
}
defer seq.Release()
msg := bytes.Repeat([]byte("X"), 2048)
txn := db.NewTransaction(true)
for i := 0; i < numKeys; i++ {
if txn == nil {
txn = db.NewTransaction(true)
}
id, _ := seq.Next()
// Leave 0 index unassigned
if id == 0 {
id, _ = seq.Next()
}
key := make([]byte, len(prefix), len(prefix)+8)
copy(key, prefix)
b := make([]byte, 8)
binary.BigEndian.PutUint64(b, id)
key = append(key, b...)
if err := txn.SetWithMeta(key, msg, byte(1)); err != nil {
panic(err)
}
if i%50000 == 0 {
if err := txn.Commit(nil); err != nil {
panic(err)
}
fmt.Printf("Wrote %d / %d\n", i, numKeys)
txn = nil
}
}
if txn != nil {
if err := txn.Commit(nil); err != nil {
panic(err)
}
}
fmt.Println("Number of keys inserted:", numKeys)
}
Now if I try it with a completely new dataset from this test, I get the following:
2019/04/16 11:21:27 Replaying from value pointer: {Fid:0 Len:0 Offset:0}
2019/04/16 11:21:27 Iterating file id: 0
2019/04/16 11:21:27 Iteration took: 26.775µs
Wrote 0 / 700000
Wrote 50000 / 700000
Wrote 100000 / 700000
Wrote 150000 / 700000
Wrote 200000 / 700000
Wrote 250000 / 700000
Wrote 300000 / 700000
Wrote 350000 / 700000
Wrote 400000 / 700000
Wrote 450000 / 700000
Wrote 500000 / 700000
Wrote 550000 / 700000
Wrote 600000 / 700000
Wrote 650000 / 700000
Number of keys inserted: 700000
Time taken for rewind 15.224µs
Time taken for seek 15.173µs
key="consumer.queue:c:\x00\x00\x00\x00\x00\x00\x00\x01", version=2, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x00\xc3Q", version=3, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x01\x86\xa1", version=4, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x02I\xf1", version=5, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x03\rA", version=6, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x03Б", version=7, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x04\x93\xe1", version=8, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x05W1", version=9, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x06\x1a\x81", version=10, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\x06\xdd\xd1", version=11, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\a\xa1!", version=12, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\bdq", version=13, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\t'\xc1", version=14, meta=42 ; size=2103
key="consumer.queue:c:\x00\x00\x00\x00\x00\t\xeb\x11", version=15, meta=42 ; size=2103
Count: 700000
If I take the existing database from a local test of my application, which contains 600k items, and run the test against that existing db, I get this:
2019/04/16 11:28:43 Replaying from value pointer: {Fid:0 Len:0 Offset:0}
2019/04/16 11:28:43 Iterating file id: 0
2019/04/16 11:28:45 Iteration took: 1.901999173s
2019/04/16 11:28:45 Iterating file id: 1
2019/04/16 11:28:48 Iteration took: 2.849960486s
2019/04/16 11:28:48 Iterating file id: 2
2019/04/16 11:28:51 Iteration took: 2.972867513s
2019/04/16 11:28:51 Iterating file id: 3
2019/04/16 11:28:53 Iteration took: 2.06459984s
Time taken for rewind 2.279651902s
Time taken for seek 2.449865035s
key="consumer.queue:c:\x00\x00\x00\x00\x00\x13\xd6!", version=4183, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x14\x99q", version=4213, meta=42 ; size=1391
key="consumer.queue:c:\x00\x00\x00\x00\x00\x15\\\xc1", version=4243, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x16 \x11", version=4273, meta=42 ; size=1391
key="consumer.queue:c:\x00\x00\x00\x00\x00\x16\xe3a", version=4303, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x17\xa6\xb1", version=4333, meta=42 ; size=1391
key="consumer.queue:c:\x00\x00\x00\x00\x00\x18j\x01", version=4363, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x19-Q", version=4393, meta=42 ; size=1391
key="consumer.queue:c:\x00\x00\x00\x00\x00\x19\xf0\xa1", version=4423, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x1a\xb3\xf1", version=4453, meta=42 ; size=1391
key="consumer.queue:c:\x00\x00\x00\x00\x00\x1bwA", version=4483, meta=42 ; size=3213
key="consumer.queue:c:\x00\x00\x00\x00\x00\x1c:\x91", version=4633, meta=42 ; size=1391
Count: 600000
What are these high version numbers telling me? I must be missing something in comparing my application to this test case. My application is basically a FIFO queue where it stores items and then pops them over time. The db for this test app is 3.7G. Do you have advise for how I can go about isolating this issue and providing more info?
I noticed that in your sample output, your badger 2.0.0 test is doing a compaction on Close, which my default 1.5.4 options don't seem to be doing? I tried my same against HEAD (2.0.0rc) and after the first slow run it performed a compaction. Then the second run I got faster timing:
... Replay took: 19.445us
...
Time taken for rewind 338.83ms
Time taken for seek 340.888ms
It is still not very fast, but after the compaction it was able to start up very quickly, and then the rewind and seek at least got into the < 1s range instead of >2s. So I get that in my production 1.5.4 build I must have a very fragmented database that isn't compacting as expected. But I would still hope to get these iterator ops down into the < 10ms range
@justinfx Badger is append only system. If you insert same key multiple times, every time it will be inserted with different version. Old versions are removed after compaction(you can also keep versions). I tried to run the program provided by you, it is running faster on my machine. Number of keys in DB is in millions. Looking at the system configuration provided by you, it should run faster on your system also. Want to confirm if you are using SSD.
My application runs on a VM hosted in an on-prem vsphere. The storage may or may not be ssd but it's running out of its local block storage as opposed to the nfs mounts. As for these timing tests in this ticket, I am running against local disk but I will have to confirm if it is ssd. It may be a spinning disk.
But like I said, I noticed a behavioral difference between badger 1.5.4 and 2.0.0rc where the former in production is likely not performing automatic compaction properly, yet the 2.0.0 test did a compaction on shutdown. Even after the compact of the test production data, it is still pretty slow. I leave the default setting that say it shouldn't keep old versions but that doesn't seem to make it much faster after compaction. I don't really know what to do here. My application isn't meant to be rewriting the same key over and over since I use the badger Sequence for new items and I only rewrite it once when I change the meta value in the key.
There seems to be something very fiddly going on here. I have the equivalent implementations of an interface for badger and bolt and bolt doesn't have this same slow timing to get the first item with an iterator. There is something extra that I must need to do with badger to keep it performing the same way.
I spent the whole day yesterday debugging this issue and managed to isolate the cause into a small reproduction! The slow Rewind()
happens after many deletes have taken place. Here is a summary of the criteria:
/dev/shm
ramWriteBatch
Sequence
w/ binary encoding but also tested with simpler int string formatReproduction which mimicks production usage of large item queues filling up and draining over time:
package main
import (
"fmt"
"os"
"strconv"
"time"
"github.com/dgraph-io/badger"
)
const (
numKeys = 600000
dbdir = `./badgerdb_test`
)
func main() {
os.RemoveAll(dbdir)
defer os.RemoveAll(dbdir)
opts := badger.DefaultOptions
opts.Dir = dbdir
opts.ValueDir = dbdir
opts.CompactL0OnClose = false
fmt.Println("Opening", dbdir)
db, err := badger.Open(opts)
if err != nil {
panic(err)
}
defer db.Close()
checkIterator(db)
fmt.Println("Writing to DB")
keys := writedb(db)
checkIterator(db)
fmt.Println("Deleting from DB")
deleteKeys(db, keys)
// Doing the most aggressive compaction that would
// happen during production
db.Flatten(1)
checkIterator(db)
}
func checkIterator(db *badger.DB) {
err := db.View(func(txn *badger.Txn) error {
opts := badger.DefaultIteratorOptions
opts.PrefetchValues = false
it := txn.NewIterator(opts)
defer it.Close()
start := time.Now()
it.Rewind()
fmt.Printf("** Time taken for rewind: %s **\n", time.Since(start))
return nil
})
if err != nil {
panic(err)
}
}
func writedb(db *badger.DB) [][]byte {
keys := make([][]byte, 0, numKeys)
// Normally using manual batching with txn/iterator in production
batch := db.NewWriteBatch()
for i := 0; i < numKeys; i++ {
// Normally using binary encode uint64 in production
key := []byte("test" + strconv.FormatInt(int64(i), 10))
if err := batch.Set(key, nil, 0); err != nil {
panic(err)
}
keys = append(keys, key)
if i > 0 && i%100000 == 0 {
fmt.Printf("Wrote %d / %d\n", i, numKeys)
}
}
if err := batch.Flush(); err != nil {
panic(err)
}
fmt.Println("Number of keys inserted:", numKeys)
return keys
}
func deleteKeys(db *badger.DB, keys [][]byte) {
// Normally using manual batching with txn/iterator in production
batch := db.NewWriteBatch()
for i, k := range keys {
if err := batch.Delete(k); err != nil {
panic(err)
}
if i > 0 && i%100000 == 0 {
fmt.Printf("Deleted %d / %d\n", i, len(keys))
}
}
if err := batch.Flush(); err != nil {
panic(err)
}
fmt.Println("Number of keys deleted:", len(keys))
}
Results (snip):
Open..
** Time taken for rewind: 1.346us **
Write...
** Time taken for rewind: 7.512us **
Delete...
** Time taken for rewind: 318.240733ms **
So I have seen sustained times to rewind the interator of 300-420ms after keys have been deleted. Is there any explanation for this?
@justinfx I ran the above code and below is the output.
Opening ./badgerdb_test
badger 2019/04/25 18:14:27 INFO: All 0 tables opened in 0s
** Time taken for rewind: 1.877µs **
Writing to DB
Wrote 100000 / 600000
Wrote 200000 / 600000
Wrote 300000 / 600000
Wrote 400000 / 600000
Wrote 500000 / 600000
Number of keys inserted: 600000
** Time taken for rewind: 7.937µs **
Deleting from DB
Deleted 100000 / 600000
Deleted 200000 / 600000
Deleted 300000 / 600000
Deleted 400000 / 600000
Deleted 500000 / 600000
Number of keys deleted: 600000
badger 2019/04/25 18:14:30 INFO:
badger 2019/04/25 18:14:30 INFO: Level: 0. 0 B Size. 0 B Max.
badger 2019/04/25 18:14:30 INFO: Level: 1. 0 B Size. 268 MB Max.
badger 2019/04/25 18:14:30 INFO: Level: 2. 0 B Size. 2.7 GB Max.
badger 2019/04/25 18:14:30 INFO: Level: 3. 0 B Size. 27 GB Max.
badger 2019/04/25 18:14:30 INFO: Level: 4. 0 B Size. 268 GB Max.
badger 2019/04/25 18:14:30 INFO: Level: 5. 0 B Size. 2.7 TB Max.
badger 2019/04/25 18:14:30 INFO: Level: 6. 0 B Size. 27 TB Max.
badger 2019/04/25 18:14:30 INFO: All tables consolidated into one level. Flattening done.
** Time taken for rewind: 238.060639ms **
badger 2019/04/25 18:14:30 DEBUG: Storing value log head: {Fid:1 Len:43 Offset:7223243}
From output, it can been seen that size of data on all levels is 0, hence there is no data on disk. Everything is present in memtables(in memory). Flatten wont change anything here.
Since Badger is append only, if we insert same key, it will be inserted with different version every time. During read, key with highest version will be read.
Iterator by default reads only latest version of all keys(you can change this). If that version of key indicates a delete, it goes to next key in sorted order. Purpose of Rewind is to point to first valid key.
Now lets discuss output for each Rewind: Open: At this time, there is nothing in db and hence it will run fast. Insert: At this time, there is single version for all keys in db. Rewind will point to smallest key and return. Detele: At this time, there are two versions for all keys in db and latest version indicates a delete. Iterator will try to find first valid key(not deleted) by traversing all keys in sorted order. Since all keys are deleted, it will end up traversing all keys. Hence the delay.
Hope this clarifies your doubt.
I appreciate you running the test and breaking down what is happening with each step. But I feel this answers only half my question, and have left me with even more questions.
You are saying that it is perfectly normal for the iterator to go from an 8us rewind time with 1 version, to hundreds of ms time after the deletes make a second version?
Flatten makes no difference? When do I get to recover my better 8us performance when the db is empty if a flatten or close/open doesn't do it? Are you saying that the baseline rewind time will continue to grow as I create and delete new keys over time? What do I need to do other than delete the db and start over every 24 hours?
Is badger suitable to be used for a FIFO queue application, if iterators have a climbing performance hit with keys being added and deleted constantly?
Sorry for having more questions but now I am left feeling like I shouldn't be using badger for this app.
Few things here: Instead of doing Rewind(), then Seek(), you could use prefix iterator, by just setting the prefix in the iterator options (on 2.0 series). That'd avoid picking up tables which don't have that key at all.
However, looks like there're not that many SSTables to begin with and even so, are all in memtable or L0 -- so it's just about how fast can we iterate upon the key versions. Note that deletions are still versions, LSM trees are append-only structures (which have their own benefits). So, every write is a new version. Over time, as compactions happen, these versions get discarded (dropped on the floor) based on some logic.
Is badger suitable to be used for a FIFO queue application, if iterators have a climbing performance hit with keys being added and deleted constantly?
It depends upon how many keys iterator needs to go through. As compactions happen, the delete markers discard all earlier versions (and sometimes themselves), so the number of keys in the tree are reduced. It is quite possible that your Level 1 would only have 1 SSTable, while you started with 10 in Level 0, if you're deleting your keys often.
FIFO applications tend to be write-heavy. If that's your scenario, then you'd be better suited with an LSM tree, because their write throughput is unmatched. B+-trees tend to give the best read latency, because they can update records in-place (which has its downsides). Badger is between. It acts like an LSM tree, but can provide read latencies closer to B+-trees.
Are you saying that the baseline rewind time will continue to grow as I create and delete new keys over time? What do I need to do other than delete the db and start over every 24 hours?
That shouldn't be the case. Most likely your rewinds are being serviced entirely by in-memory tables. In that case, the number of keys there make a difference to your Rewind/Seek latency. You can limit how many tables would be in-memory, to decrease the total number of keys present in memory.
If your data gets trashed even before it hits Level 1, then your DB size would mostly stay constant.
If read performance in micro-seconds is super critical to your app, more so than write performance, then yeah, you're serviced better by a B+-tree DB.
Few things here: Instead of doing Rewind(), then Seek(), you could use prefix iterator, by just setting the prefix in the iterator options (on 2.0 series). That'd avoid picking up tables which don't have that key at all.
My production app will have 3 prefixes per database:
I just tried using a prefix iterator and only a call to Rewind
in the test case and it still results in a +300ms Rewind
time to see the first item in the prefix.
You can limit how many tables would be in-memory, to decrease the total number of keys present in memory.
I just tried setting:
opts.NumMemtables = 1
opts.NumLevelZeroTables = 1
opts.NumLevelZeroTablesStall = 2
and it made no difference in the Rewind time
.
If read performance in micro-seconds is super critical to your app, more so than write performance, then yeah, you're serviced better by a B+-tree DB.
Well I don't need micro-second performance. But there is a big difference in 1ms and 300-400ms when an application query needs to find out the first and last item in the database. I'm only comparing this to my query times in the BoltDB approach where I could make these queries in less than 10ms. If my FIFO high-write queue application means that my baseline iterator time is going to be +300ms because of the heavy deletions, I can no longer efficiently make these queries.
I have to admit that I don't fully understand all the settings that are available in the db configuration (NumLevelZeroTablesStall
for instance) and what I should change to better suit my usage. But I am understanding from the collection of your feedback that maybe for this use case I should stick with Bolt. I'm going to close this since I don't think there is much else to do. Thanks for helping me with this!
I met the same situation, even if my database has no items, it.Rewind() still cost 1.8 seconds.
The code is very simple:
opts := badger.DefaultIteratorOptions
opts.PrefetchSize = 10
opts.Prefix = []byte("accesslog_")
it := txn.NewIterator(opts)
defer it.Close()
b1 := time.Now()
it.Rewind()
logs.Println("=>", time.Since(b1).Seconds(), "seconds")
And i have to delete the BADGER DATABASE FILE, and recreate a new database.
@iwind if your situation is similar to mine, then a db.Flatten()
or GC might drop that into the hundreds of ms range. And my problem was that I couldnt get any lower than that in the running database.
I tried to ask this first as a question on Slack, but it hasn't gotten any replies
My service has modular support for both BoltDB and more recently for Badger. I have some stats gathering logic that periodically wants to look up the first and last item for a prefix keyspace. I'm seeing a pretty huge performance difference between Bolt and Badger so I feel I must be doing something wrong on the Badger side. The db has 500k items in a single keyspace. In Bolt I am able to scan out the first and last item in less than 1ms. But with Badger I am seeing over 4 seconds for the same logic. There are technically two scans going on, so if I disable one of them then it is in the 2sec range. I'm doing a key-only scan as I only will need to pull the value for the first and last item. This is an example of my logic:
I see these isolated timings:
Is this normal to take such a big hit on a 500k keyspace just to rewind and seek the iterator to the first item?