dgraph-io / badger

Fast key-value DB in Go.
https://dgraph.io/badger
Apache License 2.0
13.78k stars 1.17k forks source link

PurgeOldVersions+RunValueLogGC does not seem to still work for me immediately #444

Closed fingon closed 6 years ago

fingon commented 6 years ago

Given long-running process (that only uses db.View and db.Update short-lived transactions), calling db.PurgeOldVersions and then db.RunValueLogGC(0.5) does not cause any data to be deleted even if I set and subsequently almost delete every key. I do have 'few' (5) vlog files and their associated ssts in this case.

Backup of the database is minimal at this stage (16MB as opposed to 5GB on-disk size; restored size is 37MB).

However, if I start fresh Badger db instance and do the same (PurgeOldVersions + RunValueLogGC) it behaves as expected. Is there something else than 'The caller must make sure that there are no long-running read transactions running before this function is called, otherwise they will not work as expected.' to care about?

I might have short-lived read transactions colliding with PurgeOldVersions, but most likely not; the behavior seems repeatable and at least my reading of that comment is that it does not blow things up altogether..? (Documentation on this could be bit more precise anyway.)

janardhan1993 commented 6 years ago

@fingon: Can you please share a reproducible example.

fingon commented 6 years ago

I will try to come up with one; it comes up from using the relatively minimal API in https://github.com/fingon/go-tfhfs/blob/master/storage/badger/badger.go from number of goroutines in parallel, and eventually the flush part will simply stop doing anything (RunValueLogGC always returns the error of not having done anything.)

tobiasrm commented 6 years ago

I have a similar problem using badger to store an index for some faster data lookup. After some trials, I tested the cleaning by simply storing 100.000 string keys/values (without overwrites) using the following code and the write/delete example code of the badgerDB documentation; the cleaning is always done afterwards.

for i := 0; i < 100000; i++ {
    WriteToDB( strconv.Itoa(i), strconv.Itoa(i), db )
    DeleteFromDB( strconv.Itoa(i), db )                 // un-/commented this line
}
db.PurgeOlderVersions()
db.RunValueLogGC(0.3)

Results:

It seems to me that the cleaning didn't work and the delete transactions blew up the value log (makes sense if deletes are not removed). However, the descriptions on the badger db project site mentioned that purging old versions + GC should clean deletes, too.

Can you confirm that this is unusual and if yes, do you have a bugfix? My current DB size grow up to multiple GB for ~500MB production data, where the uncompressed json serialized data structure has mere ~200MB ...

jiminoc commented 6 years ago

I've experienced this as well and put a repro case in https://github.com/dgraph-io/badger/issues/464

manishrjain commented 6 years ago

@tobiasrm : So, I created this code: https://gist.github.com/manishrjain/647cc6ea41d2c10769a61f8c517dddae

and tested the write-delete pairs, where all keys written are deleted. With the latest change #471, the SSTable compaction discards most keys leaving a 212 byte sstable.

$ badger info --dir=.                                                             ~/badgertest
Listening for /debug HTTP requests at port: 8080

[2018-05-01T06:11:29-07:00] MANIFEST       48 B MA
[                      now] 000002.sst    212 B L1
[                      now] 000000.vlog   20 MB VL

[Summary]
Level 0 size:          0 B
Level 1 size:        212 B
Total index size:    212 B
Value log size:      20 MB

Abnormalities: None.
0 extra files.
0 missing files.
0 empty files.
0 truncated manifests.
SSTable [L1, 002] [216261646765722168656164, v200001     ->           3939393939, v200000    ]

Note that the value log is at 20MB. GC would not reclaim the latest value log, which is a read-write log. It only works on the value logs which have become "immutable". So, in a long-running process, you'll see the value logs being reclaimed over time.

Also, you don't want to avoid the storage costs of value log, you could increase the ValueThreshold https://github.com/dgraph-io/badger/blob/e597fb721a61d074b8f7d128a40a6833409dc68b/options.go#L120, to always store values along with the LSM tree. That way, whenever GC runs, it would be able to reclaim the entire value log file (minus the latest one).

I think the above-mentioned PR is a resolution to this issue. If you have other concerns, feel free to reopen or create a new issue.

fingon commented 6 years ago

At least I still encounter the same issue using badger from commit 754278dbecbe8bfa0fec445d6af3bc2e0cf21911, unfortunately the project I was working on is stalled so I am not really motivated to tease out minimal testcase out of it.

What I am doing:

Re-opening database cleans it.

manishrjain commented 6 years ago

It sounds like you'd have one or two value logs, so value log GC won't touch them. Also, 10K keys would barely fill even one memtable. In other words, there's not much data to clean up.

fingon commented 6 years ago

To test that with smaller dataset I've set it to bit smaller config values ( opts.ValueLogFileSize = 1 << 27 ) so in that case it is 6 value logs; still no go, all stay with 0 values visible in database. Also happened on larger scale back when I opened the bug (~50GB of data in 400 value log files).

Being lazy, I did only the smaller test on this most recent version.

Restarting and running same steps again cleans the value logs.

manishrjain commented 6 years ago

If you can share a working code example, I could run it and see if I can reproduce it. Otherwise, it's hard to tell what's going on. My experimentation based on https://gist.github.com/manishrjain/647cc6ea41d2c10769a61f8c517dddae showed things working as expected.