dominictarr / crdt

Commutative Replicated Data Types for easy collaborative/distributed systems.
MIT License
836 stars 43 forks source link

Millions of rows #6

Open Raynos opened 12 years ago

Raynos commented 12 years ago

There is the millions of rows & memory issue.

I'm envisioning being able to have a document with large N rows which are lazy loaded.

Then you can create limited Sets (with smarter queries on what defines a set) which you know to be of a small M size.

Then you just need a way to sychronize changes to Sets without synchronizing the entire document.

dominictarr commented 12 years ago

There is already some support for replicating only a single item. hmm... no I think I removed that feature when I switched to using scuttlebutt.

Basically, you need to detect when a object enters a set and when it leaves. hmm. when it enters you might need to send entire state (the significant events that created the current state)

When it leaves a set, a node can just forget that object, but when it's added you need to update all the events from that object. hmm. You can't resend old events, if it's past the scuttlebutt timestamp for that source... you could send the current snapshot, but that will not be eventually consistent with concurrent updates to that object...

I think you will need a special event that attaches the history for that update into the enter event, and then that is inserted into the model. hmm, crdt values are always {}, so you could make an array have a special meaning...

[key, [HISTORY], ts, source]

ts would be the latest change in HISTORY, the one which added it to the set. hmm, yeah I think this would work.

dominictarr commented 12 years ago

I agree, this would be a really cool feature.

Raynos commented 12 years ago

@dominictarr not just really cool. But fundamentally important.

Alternatively we just write a distributed DHT.

The dataset for an application will never fit into memory unless a) very well designed b) low number of users.

dominictarr commented 12 years ago

Yes, but this was why I it's called a crdt Document, and not a crdt Database. If you are using a database like couchdb, then the client will only load a few records at a time.

So I figured, you could always just partition the documents onto many processes, each having a more reasonable number of rows. Is that what you mean by DHT?

it would be possible to make it larger but we'd have to rewrite it in something that's not javascript.

hmm, prehaps if we just made the scuttlebutt functions all async, that could work too. probably a combination of these approaches is best.

dominictarr commented 12 years ago

A DHT should be pretty easy, actually, using crdt to replicate the list of nodes, (which will still be a relatively small document, a few hundred nodes is pretty big) and then keeping a replica of a given set of crdt documents in a couple of nodes. Just like dynamo, but REALTIME.

Raynos commented 12 years ago

@dominictarr a few hundred nodes is a few hundred concurrent customers.

I should go read dynamo and read about DHTs.

CoderPuppy commented 11 years ago

:+1: