RJ / www.metabrew.com

Static site generation for my blog
0 stars 0 forks source link

Anti-RDBMS: A list of distributed key-value stores #12

Open RJ opened 3 years ago

RJ commented 3 years ago

Written on 01/19/2009 19:38:43

URL: http://www.metabrew.com/article/anti-rdbms-a-list-of-distributed-key-value-stores

RJ commented 3 years ago

Comment written by Jan Lehnardt on 01/19/2009 19:51:15

Excellent overview, thanks!

For CouchDB: We are working on partitioning but it is currently not a priority as we are working on the 0.9 release. This is definitely a planned feature, though. In the meantime, a consistent-hashing HTTP proxy or storage layer in your will get you partitioning really easily, just not as convenient as if CouchDB would do it for you.

Cheers
Jan
--

RJ commented 3 years ago

Comment written by scott on 01/19/2009 20:28:50

If the documentation is accurate, then memcachedb supports replication via BerkleyDB's replication features.

See this pdf starting at slide 39: http://memcachedb.org/memca...

RJ commented 3 years ago

Comment written by Jake Luciani on 01/19/2009 20:32:22

Thanks for the nice breakdown of services. We want Thrudb to be the efficient layer that glues storage and indexing backends together since each one has it's own strength and weaknesses.

RJ commented 3 years ago

Comment written by X6J8X on 01/19/2009 21:18:51

Did you take a look at MongoDB?

http://www.mongodb.org/

Looks promising...

RJ commented 3 years ago

Comment written by Jay Kreps on 01/19/2009 21:24:22

Hi Richard,

We only just got the site up for project voldemort a week ago and started getting linked to before it was finished, so you are right that a lot of data is missing about how LinkedIn uses it, basic performance numbers, and especially about operational necessities like JVM settings, data archival, etc. Of course all of this information is quite critical if you are thinking of storing your precious data in it. I am going to work on getting all of this in shape over the next few weeks. We will probably do a LinkedIn engineering blog post soon which has more specifics about how we are using it.

I would also add that, as you mention, having a JVM-only API is a big weakness right now. The plan for this is to move all the network serialization to Protocol Buffers or Thrift and support first class clients in popular languages in our source tree. This will use the ability to push all the routing and conflict resolution to the server so that the non-java clients need only implement the code necessary to put the message on the wire. I will post the branch as soon as I get a chance to start development, in hopes that people will be willing to work on a client in their favorite language.

RJ commented 3 years ago

Comment written by Toby DiPasquale on 01/19/2009 21:58:06

Werner Vogels, Amazon's CTO, seems to have some disparaging things to say about Scalaris:

http://twitter.com/Werner/s...

What are your thoughts on his thoughts?

RJ commented 3 years ago

Comment written by Rob on 01/19/2009 22:17:26

You may want to add M/DB to your list: a free API-compatible alternative to SimpleDB (http://www.mgateway.com/mdb...

RJ commented 3 years ago

Comment written by Marton Trencseni on 01/19/2009 22:32:22

The Google stack uses Paxos, Amazon's Dynamo uses vector clocks. I'm not sure what these systems use (I know Voldemort uses vector clocks). It's important to see the difference.

Paxos is a consensus protocol that requires the majority of the nodes to be up. For example, Google's Chubby systems use Paxos as a primitive to reach consensus on the value to append to a replicated log (a new round of Paxos is run for each entry in the replicated log). The entries are treated as commands, which are then applied to the database. This is called the distributed state machine approach. Because of Paxos, roughly speaking, you can be sure that all nodes see the same entries in the same order in their replicated log - there is no chance of inconsistency due to Paxos. (Your implementation could of course be buggy..) This also means that in case of a netsplit, or enough nodes failing, the minority will not be able to make progress.

Dynamo, on the other hand, uses vector clocks as the distributed primitive. In this approach, clients select some nodes to write to. This selection uses a hash table, but in terms of consistency guarantees, this is not that important. As Werner Vogels describes on his blog, you can tune how many nodes you write to, this is usually denoted W. Usually W = 2 or 3. If nodes fail, other nodes can accept the write requests. This means that in this approach, in case of a netsplit, the disjoint parts of the network will accept diverging versions. This is what Amazon calls the "always-on" experience. When the disjoint parts join up, these versions have to be reconciled. This is usually handled by writing application-specific code, or doing something primitive like relying on the node's local time and keeping the newer versions. Due to this weak consistency semantic, I think Paxos-based systems are more likely to get a foothold in the real world: although they require a majority to be up, their consistency semantics are far easier to grasp, and additionaly, they don't require application-specific code for reconciliation. AFAIK, Amazon only uses Dynamo where the reconciliation can be handled naturally, like the shopping cart use-case.

As an interesting sidenote, both Paxos and vector-clocks were introduced by Leslie Lamport.

For a collection of papers, see http://bytepawn.com/reading...

It's possible for the distributed system to be made up of many Paxos cells, in which case, if a Paxos cell is down (not enough nodes to make progress), the controller redirects the write request to a live cell, thus, at a higher level, introducing inconsistent versions and emulating the reduced availability requirements of a Dynamo-like architecture.

There are additional design-options to make Paxos more attractive, but I can't say more...

RJ commented 3 years ago

Comment written by joew on 01/19/2009 22:42:10

What about Tokyo Cabinet and Tyrant? It looks like Tryrant (the network interface) speaks memcached's protocol and allows for replication. Might be worth a look but probably falls into your second category of distributed key-value stores.

RJ commented 3 years ago

Comment written by Russ on 01/19/2009 23:02:01

Is SimpleDB itself not a viable alternative?

RJ commented 3 years ago

Comment written by Jason Huggins on 01/19/2009 23:26:55

@Russ SimpleDB is tied to one vendor (Amazon) and the only way to use it is on their cloud. If you could download and run SimpleDB on your own server/workstation or on another cloud vendor, then it *might* be viable. FWIW, I'm using CouchDB in production at my startup and really liking it so far. Though I do miss SQL for simple queries sometimes.

RJ commented 3 years ago

Comment written by Cliff Moon on 01/19/2009 23:28:09

Hi, thanks for the Dynomite mention. I just wanted to let you know that Dynomite currently does read-repair using vector clocks. And there's a small community starting to form around it. And I agree, I really need to write up some more comprehensive documentation.

At Powerset we're using it for scraping and serving images from our data sources (currently only wikipedia). I've been doing a lot of work on performance recently and will do some posts about getting the 99.9% latency under control.

RJ commented 3 years ago

Comment written by vicaya on 01/20/2009 00:55:58

Do you have any numbers to back up your claims? For example, I haven't seen you asking any related questions in Hypertable mailing lists/forums. Do you know Hypertable can support 10K+/s (1 KB size) random reads/s *per node* (1 4-core 2.33Ghz Xeon) using a single threaded client with tables configured as "in memory" (still persisted, checksumed and replicated 3-ways on disks for recovery)?

As for random write throughput, nobody on the list comes close to Hypertable (1TB injection replicated 3-ways at 1M cells/s on a 8 node cluster).

OTOH, serious web services should still use memcached (as a recent facebook engineerng note pointed out can serve up to 400k ops/s per node) for application specific caching (as only the application has most of the information to do intelligent invalidation.) and use a distributed DB for transactions and persistence.

RJ commented 3 years ago

Comment written by Adinel on 01/20/2009 00:57:27

I was just looking into MonetDB http://monetdb.cwi.nl/
It didn't make your list...

Great post!

RJ commented 3 years ago

Comment written by Marton Trencseni on 01/20/2009 01:17:30

Russ, your system doesn't have to be very complicated for consistency to be an issue. With n=3 nodes, if you have a netsplit between them and you get divergent versions, then your application will have to deal with it. Of course, depending on the application, this may be trivial. (For example, in the Amazon shopping cart case, you can keep the maximal item count per item in the divergent shopping carts.)

Anyway, my point was, if you're designing a distributed system where you want wide applicability then systems with stronger consistency guarantees (= easier to understand) win out. This is why RDBMS/SQL is king.

RJ commented 3 years ago

Comment written by Marton Trencseni on 01/20/2009 01:46:33

Russ, that's just what I was saying in my original comment. Anyways, to me, the table in the article is really missing a 'consistency' column that would tell me what I can expect from these KV-stores [what we're discussing, ie. do I have to write reconciliation code or not].

RJ commented 3 years ago

Comment written by vicaya on 01/20/2009 02:15:21

@Russ, please consult our community if you have performance problems. Many problems are discovered and fixed all the time. Hypertable has many knobs to turn. There are R&D underway (with UC Berkeley etc.) to use machine learning to let users to specify declaratively, consistency, latency & throughput requirements and let an adaptive model to pick the right implementation strategy.

RJ commented 3 years ago

Comment written by Steve Chu on 01/20/2009 06:00:18

Hi, thanks for your mention of MemcacheDB. Currently, it does supports HA, which is a Paxos compatible master-client replication with electing(we build it on top of the BerkeleyDB's replication framework). And there are couple of users, including Chinese biggest portal - Sina.com.cn, and digg? reddit? and other dotcom company and startups.

RJ commented 3 years ago

Comment written by Carl Byström on 01/20/2009 18:41:25

How would one index all data available in distributed k/v store? Just as the data, the index should be distributed as well to allow scalability. So far I've only seen BigTables and SimpleDB having support for this. (Well, CouchDB has it with it's map/reduce queries and views, but it's not distributed in my sense. Yet.)

RJ commented 3 years ago

Comment written by Marek Majkowski on 01/20/2009 20:51:22

> Carl: How would one index all data available in
> distributed k/v store?

You don't. You use another tool.

RJ commented 3 years ago

Comment written by agustin deschamps on 01/20/2009 22:56:42

what exactly were the reasons why couchdb didnt make the final cut?

i ask because we are looking to start development on a system that would use a product like this for building output for WS calls in real time.

it looks like the only product on the list that is today being used heavily in real world settings is haddop, but couchdb is going to start to see production deployments soon. am i mistaken about this?

RJ commented 3 years ago

Comment written by Jeff on 01/21/2009 08:42:50

Umm, there's also Neptune, I guess: http://www.openneptune.com.

RJ commented 3 years ago

Comment written by Nick Radov on 01/22/2009 01:19:29

You missed IBM Lotus Domino. The document database at its core does all that stuff.
http://www-01.ibm.com/softw...

RJ commented 3 years ago

Comment written by Takeru on 01/23/2009 12:44:43

Hi, thanks for the Kai mention.

In Kai, data is replicated like Amazon's Dynamo. Same as Dynamite, Kai currently does read-repair using vector clocks. And there's a small community found in sourceforge.net, where mainly spoken in Japanese as you pointed out.

Kai is under testing at goo.jp for production use, which is Japanese 2nd large portal site. We're going to write up more comprehensive documentation.

RJ commented 3 years ago

Comment written by Alexander Staubo on 01/26/2009 01:56:38

CouchDB is not really distributed at the moment, although it is certainly designed to be. Currently, replication is done by manually running a command-line tool that synchronizes stuff between databases. I don't think they even have the conflict resolution stuff worked out.

RJ commented 3 years ago

Comment written by Kunthar on 01/27/2009 19:33:42

Hope some Erlang masters check out dets bottleneck for mnesia and we can get changeable cool dht/db for Erlang.
Peace.

RJ commented 3 years ago

Comment written by Alexander Staubo on 01/27/2009 22:46:54

Forget what I said: CouchDB apparently has some replication now: http://wiki.apache.org/couc....

RJ commented 3 years ago

Comment written by Poppy on 01/28/2009 01:53:20

I'm not sure I understand why you disapprove of Cassandra and CouchDB. I don't see a clear reason listed for either.

Have you got any performance statistics?

RJ commented 3 years ago

Comment written by Thomas Louis on 01/28/2009 11:37:37

Hi Richard,
it has been said a lot of times, but Great Site.

You mentioned hazelcast once without any comment. I tried it out but not took the chance evaluating it for very high amount of data or throughput. It has a very simple API and the developer made it open source lately. It should be very fast and reliable, but you are much more knowlege is much deeper to judge this.

It would be very interesting to hear opionons.

RJ commented 3 years ago

Comment written by nharward on 02/18/2009 23:35:36

I 2nd Tokyo Tyrant... at least initially; we're in the middle of evaluating it but have no concrete test results as yet. The administration of master/master[/master...][/slave.../] is about as easy as it gets, as is general operational administration. I've found in the past (as I'm sure many have) that operational maintenance is a major pain point and one reason I'm leaning towards it.

And thanks to your post we have some other avenues to pursue as well.

RJ commented 3 years ago

Comment written by Anders Nawroth on 02/19/2009 17:32:53

If you want the benefits of a higher abstraction level than key/value stores provide, a graph database is an interesting alternative to RDBMS as well. Most data fits naturally into a graph representation. http://neo4j.org/ is an open source database of this new breed. - I'm part of the team behind it.

RJ commented 3 years ago

Comment written by Serdar on 02/20/2009 20:25:54

Hi,

What do you think about Tokyo Cabinet, Tyrant and Dystopia tripod ?

RJ commented 3 years ago

Comment written by Andrew Cencini on 02/22/2009 21:33:10

Hi Richard,

I have also implemented a version of Chord in C# (nChord.sourceforge.net) in case you'd like to include it in your list.

In terms of key-value stores, around the same time many of these other systems were being built, I also had the opportunity to build, work on, deploy and test at scale another similar storage system (Resonance) based on Chord. It is no longer in use but I wrote up a short paper on it with so-so results if you'd be interested in taking a look.

Thanks,
--andrew

RJ commented 3 years ago

Comment written by Phil on 02/23/2009 11:25:40

I went through and documented (at a high level) API of Cassandra. It turns out that you either get a map from (key1,key2) to a value, or a map from (key1,key2) to a list of values.

http://www.brightyellowcow....

RJ commented 3 years ago

Comment written by Steven Roussey on 02/25/2009 00:17:25

And what of CloudBase? (https://sourceforge.net/pro... How does that fit in?

RJ commented 3 years ago

Comment written by antirez on 02/25/2009 21:19:12

I'm working on this: http://code.google.com/p/re...
That looks like a lot what you need but it is still
in beta (actually alpha to be honest).

In a few weeks and after a few of more beta releases we
hope to have Redis stable and ready for prime time since
I'm actually using it inside my company.

RJ commented 3 years ago

Comment written by Brian Aker on 02/26/2009 19:27:38

Hi!

Thanks for the great list!

Memcachedb does do sharding. This is built into any of the drivers. Also, you may want to take a look at the DB that Mixi.jp uses :
http://tokyocabinet.sourcef...

Cheers,
-Brian

RJ commented 3 years ago

Comment written by ryan on 03/02/2009 12:50:30

Hey,

I'm a developer that has been contributing to hbase - specifically I closed out HBASE-61 - new fileformat for HBase. It's been recently integrated into HBase's mainline, and is on track for the 0.20 release (next major release).

This is lots of project management stuff for what I am really trying to say - hbase's latency has been worked on. The goal is to serve websites straight from hbase. Which means as faster than mysql is the goal.

Lots of exciting things are happening in hbase, new cache systems, new file formats, zookeeper (a paxos lock/data system) integration, all of which should make 0.20 more stable, less prone to outage and faster than ever.

Join us on IRC, #hbase (freenode), or via email if you'd like to talk more!

RJ commented 3 years ago

Comment written by Matt Hamilton on 03/02/2009 15:24:37

Oddly you missed (strange given last.fm's heritage) probably the longest established object store there is. ZODB.

http://en.wikipedia.org/wik...

Given it is over a decade old, supports MVCC, clustering, multiple backend storages, etc etc it is a very worthy contender for your list.

-Matt

RJ commented 3 years ago

Comment written by Jimmy on 03/08/2009 09:13:06

what about Microsoft's project Velocity?!

RJ commented 3 years ago

Comment written by skyde on 03/13/2009 23:06:31

have you try LightCloud its based on tokyotyrant multimaster replication and consistent hashing.
http://opensource.plurk.com...

RJ commented 3 years ago

Comment written by Albert on 03/17/2009 11:21:11

Excellent article!

But my concerns are on writing speed. It seems that the comparison is always based on the latency that a database gives you when you want to serve content picking docs from the database.

I started some tests with Tokyo cabinet+Tyrant and CouchDB but I am now at the research point.

Does anyone know which one of this databases is the fastest at WRITING? The reading will be launched in background cron processes, no matter how long it takes. The application will receive thousands of connections per second (requesting write), modifying the database through HTTP requests would be the best option.

IDEAS?

RJ commented 3 years ago

Comment written by RJ on 03/17/2009 18:02:45

Thanks for all the comments and feedback - since I posted this I discovered many more interesting non-relational databases that already existed, and several new projects have surfaced too (mentioned in the comments already). The rate they are cropping up at the moment means I don't have time to keep the blog post up-to-date with the latest and greatest.

Most of the interesting projects I find are added to my delicious.

This comments thread is invaluable tho, and is serving as an excellent reference.. so please keep adding to it :)

RJ

RJ commented 3 years ago

Comment written by Pete on 03/29/2009 17:40:04

Are there any good Java open source column based databases out there.

Thanks

-Pete

RJ commented 3 years ago

Comment written by Roy on 04/20/2009 19:33:07

Great write-up. One overlooked distributed solution is
to use a commercial, fully supported, federated database.
Objectivity DB is one of these. It is massively scalable,
fully replicated, and supports not only Java but also C++
and C#, all working concurrently against the same data.

Cheers

Roy

RJ commented 3 years ago

Comment written by alps on 04/27/2009 19:58:19

Have you thought about LDAP? If you are looking things up by key (i.e., base scope search), then it is hard to find anything faster, more accurate, and more reliable than LDAP. Take a look at OpenDS (version 2 will be out in June). On my little MacBook I was running OpenDS 1.3 (build 4) with 2.3 million records and averaging 6.5 requests/second. LDAP is super simple to scale, too. (But you'll want to use Sun's LDAP Proxy Server for that, which is free, but not open source.)

RJ commented 3 years ago

Comment written by kidehen on 04/29/2009 13:45:28

What about RDF based Quad Stores, especially those implemented using hybrid database engines (by this I mean RDBMS and Entity-Attribute-Value model combos).

An example of a hybrid database engine is: OpenLink Virtuoso.

Links:

1.http:/virtuoso.openlinksw.com - Virtuoso Home page
2.http://delicious.com/kidehen/virtu... - white paper collection re. Virtuoso
3. http://bit.ly/euz2O - My post about end of RDBMS primacy and why
4. http://www.readwriteweb.com... - Interesting discussion that needs to be linked to this article (what I am doing by commenting)
5. http://bit.ly/bRwDm - Virtuoso 6.0 FAQ.

Kingsley

RJ commented 3 years ago

Comment written by Dick Davies on 04/30/2009 13:12:11

@Lucas LDAP may be an option if you have a very high read/write ratio, but it's not really designed for heavy writing. And when you do need to write, LDAP clients are generally pretty unwieldy.

RJ commented 3 years ago

Comment written by Schemafree on 05/10/2009 02:47:36

There is also Schemafree.

http://code.google.com/p/sc...

RJ commented 3 years ago

Comment written by Henrik Ingo on 05/18/2009 06:15:47

Hi

You are ommitting a "public secret" which is pretty widely used: The NDB database/storage engine also known as *MySQL Cluster*.

While it is commonly used through an SQL interface, the architecture and performance is exactly what you want: Cloud-like sharding, very good performance on key-value lookups, etc... And if you don't want the SQL, you can use the NDB API directly, or REST through mod_ndb Apache module (http://code.google.com/p/mo....

This would score high on your list if you evaluated it:

* Transparent sharding: Data is distributed through an md5sum hash on your primary key (or user defined key), yet you connect to whichever MySQL server you want, the partitions/shards are transparent behind that.

* Transparent re-sharding: In version 7.0, you can add more data nodes in an online manner, and re-partition tables without blocking traffic.

* Replication: Yes. (MySQL replication).

* Durable: Yes, ACID. (When using a redundant setup, which you always will.)

* Commit's to disk: Not on commit, but with a 200ms-2000ms delay. Durability comes from committing to more than 1 node, *on commit*.

* Less than 10ms response times: You bet! 1-2 ms for quite complex queries even.

* ...