BBVA / raft-badger

Raft backend implementation using BadgerDB
Apache License 2.0
109 stars 17 forks source link

How to use? #2

Closed nkev closed 5 years ago

nkev commented 5 years ago

Thanks for sharing. I can't believe nobody has starred this until me!

I'm not quite sure how to use this with my BadgerDB. Do I use 2 BadgerDB databases, as in one for Raft and one for my data store?

If it's not too much trouble, could you please create a simple distributed BadgerDB example?

aalda commented 5 years ago

Hi,

Yes, you should use two independent instances of Badger. A write-ahead log usually has a different workload from the most typical use cases of a key-value store (high write load, range deletes, and range reads for the last written keys) and it could add too much compaction pressure to your backend store.

We are using raft-badger together with Badger as a storage engine in one of our projects, but I'm afraid it's not a very simple example. You can check it out here:

https://github.com/BBVA/qed

We have found the main difficulty when trying to integrate Hashicorp's Raft concept of snapshots with Badger snapshots. Hashicorp's implementation assumes you can keep the whole state of the FSM in memory, not in a persistent store. If your database stores a large volume of data, generating snapshots with the whole dataset every time, without making incremental snapshots, could lead to potential performance problems. We haven't solved it yet.

I also recommend you to enable the LogCache for the LogStore in Raft in order to improve read throughput.

nkev commented 5 years ago

Thanks for your reply. Sounds complicated... :) If I'm ok with eventual consistency, would it be simpler to not use raft and just write to the other 2 server(s) after the local one (say in a separate goroutine), and mark the result in both the affected local/remote record(s)? The client would be aware of all 3 servers and switch accordingly.

Alternatively, a service like Google Cloud Storage auto replicates files on SSDs across regions. Would that work for DB storage of an app using embedded BadgerDB or would that be too slow?

nkev commented 5 years ago

Also, would Serf work as a multi-master gossip based replication?

aalda commented 5 years ago

Well, yes, if you don't need strong write consistency, you could choose a master-master design. It depends on your use case. You would have to deal with conflict resolutions, depending on whether the state machine you need to replicate is order-independent or not. There are several techniques: vector clocks, last-write-wins or CRDTs. And you can always allow strong read consistency like in Dynamo or Cassandra, provided that long latencies are not a problem.

You could use Serf for cluster membership and failure detection, but it's not designed for doing the replication work, just for gossiping a small number of events. And note that the broadcast mechanism uses UDP.

Google Cloud Storage, like S3, is an object storage solution and not suitable for databases. You should use block storage like local SSDs or persistent disk.

nkev commented 5 years ago

Thanks again for your valuable insight. I'm currently using DynamoDB and I think it uses the "Last Write wins" technique for conflict resolution, which is fine for me. But I'm finding querying options are quite limited in DynamoDB without resorting to a lot of denormalisation via global indexes. So I'm thinking of using BadgerDB and replicating it somehow to other servers at least for redundancy but preferrably scalability as well.

My use case is a site where each customer has their data (say a blog) in their own BadgerDB folder. So it's matter of one-way (or preferably 2-way) syncing these folders to other servers, either via TCP or file replication (rsync?). Any recommendations would be appreciated.

nkev commented 5 years ago

On another note, instead of Hashicorp's Raft, you might want to have a look at https://github.com/lni/dragonboat

aalda commented 5 years ago

Thanks for your suggestion, I didn't know anything about Dragonboat and I find it interesting.

If I've understood you correctly, what you are considering is taking partitioning to the extreme of one shard per user. How many shards per server do you estimate you will have? It seems to me that such deployment may be problematic if the number of users and/or the number of ops per user is high. It could lead to potential resource starvation and I/O saturation, with too many compactors working at the same time. I know that RocksDB allows you to share the block cache and the thread pool among multiple instances using a single Env object. This way, you could set the maximum number of concurrent running compactions and flushes per server. But I'm afraid Badger is not prepared to deal with such scenarios yet.

As regards the replication problem, I'd go for a state machine replication. I think it's the most suitable approach but, of course, it's harder to implement. I'm not very sure that you could use rsync in a master-master scenario. I see rsync useful for backups because sst files are immutable but I don't think it's possible to ingest sst files directly without writing first into the vlog and without having conflicts with the global incrementing counter. Anyway, I'm not aware of such details, you should ask the Badger guys.

nkev commented 5 years ago

taking partitioning to the extreme of one shard per user

No, more like one shard for each of my customers, who would store their own blog subscribers and blog posts within their shard. I was thinking isolating customer data into separate DBs would increase security and robustness. If one customer DB gets corrupted, the others remain operational. The number of DBs per server would depend on each DB activity level and the performance capability of the server.

I'm not very sure that you could use rsync in a master-master scenario

No, I was thinking of rsync only for backup as a first step (before a state machine based solution). I would like to at least replicate each of these DBs to another machine for redundancy. As you know, Badger not only has backup/restore functionality, but it is also rsync friendly for quick delta snapshots. From Badger repo:

Badger is also rsync-friendly because all files are immutable, barring the latest value log which is append-only. So, rsync can be used as rudimentary way to perform a backup. In the following script, we repeat rsync to ensure that the LSM tree remains consistent with the MANIFEST file while doing a full backup.

#!/bin/bash
set -o history
set -o histexpand
# Makes a complete copy of a Badger database directory.
# Repeat rsync if the MANIFEST and SSTables are updated.
rsync -avz --delete db/ dst
while !! | grep -q "(MANIFEST\|\.sst)$"; do :; done

As regards the replication problem, I'd go for a state machine replication

This is the part I'm not sure about for my use case. I don't expect all of the customer databases to be active all the time, but I'm thinking there would still be a lot of gossip traffic for multiple DB replications :)

aalda commented 5 years ago

No, more like one shard for each of my customers, who would store their own blog subscribers and blog posts within their shard. I was thinking isolating customer data into separate DBs would increase security and robustness. If one customer DB gets corrupted, the others remain operational. The number of DBs per server would depend on each DB activity level and the performance capability of the server.

Ah, ok. That makes a lot of sense.

This is the part I'm not sure about for my use case. I don't expect all of the customer databases to be active all the time, but I'm thinking there would still be a lot of gossip traffic for multiple DB replications :)

If you finally choose a strong consistency design, you could run a multi-raft cluster to mitigate gossip traffic, in the same way that CockroachDB does. Unfortunately, Hashicorp's library doesn't implement multi-raft yet. But it seems that Dragon Boat does it and for what I've seen, it has a more flexible snapshot mechanism. I think I'll give it a try.

nkev commented 5 years ago

Thanks. What about etcd or consul? Have played with those? I couldn't get Dragon Boat running on my Macbook because the instructions to install RocksDB didn't work at all (even with gcc 8). The Dragon Boat author also doesn't think BadgerDB is up to the task. A very interesting thread details an argument between the authors of Dragon Boat and BadgerDB... I'd love to know your thoughts.

gdiazlo commented 5 years ago

We used etcd as part of an IPAM about three years ago. The KV was a bit slow (a bolddb version), but the main problem was etc leaked memory and crashed often. We used consul for configuration management and DNS (at that time too). We found issues of compaction of the KV (I can't remember now which one, has they changed it at some point).

Probably this is all solved by now.

Consul worked fine when used for service discovery and configuration management, and that's one of the reasons we chose their libraries for our project.

About the thread you post, it started well but turned to non constructive. A pity. I think the badger people will try to address those concerns anyway. We're investigating a bug on our badger code in other project. And we are also testing rocksdb. We do not have a conclusion yet, but we appreciate your commentary and links.

Lastly, there are a lot of options for file replication. Glusterfs might solve some of your needs, I have used it in the past and worked well on commodity machines.

aalda commented 5 years ago

Consul worked fine when used for service discovery and configuration management, and that's one of the reasons we chose their libraries for our project.

In addition, we found etcd's Raft library a bit too complex and hard to reason about. That's why we chose Hashicorp's implementation. The latter has a cleaner design, but on the contrary, its interfaces are more rigid to extend (i.e. snapshotter).

As @gdiazlo has commented, we are planning to replace QED's storage engine with RocksDB, but not for the raft WAL. We have observed more stability and space efficiency in our benchmarks, but the nature of our use case is a bit special so we can't generalize for other use cases of a key-value store.

nkev commented 5 years ago

Thanks guys! I learned a lot from this thread.