dominictarr / scuttlebutt

peer-to-peer replicatable data structure
Other
1.32k stars 66 forks source link

Ability to forget history? #20

Open DamonOehlman opened 11 years ago

DamonOehlman commented 11 years ago

Hi All,

After mucking around with a few different streaming implementations, I've decided to use a CRDT based (and thus scuttlebutt) as a way of implementing a chat room. The project is definitely a work in progress, with modules chat and iceman each doing part of the work.

Using CRDT documents, with the sequence abstraction for room messages and sets for room users works really, really well. The only downside is that when users are removed from the room their history records are still communicated when new clients come on board. Over time I can see this (and the message backlog) history taking a while to get in sync.

As such, I did something a little radical and removed the history records manually:

https://github.com/DamonOehlman/iceman/commit/723e43eade1a8dfe0478ae8437046b7b0e89afcf#L2R40

I'm wondering whether this is acceptable practice, or there is a better way to deal with this when using scuttlebutt / crdt.

Cheers, Damon.

Raynos commented 11 years ago

I have an expiry-model ( https://github.com/Raynos/expiry-model/blob/master/index.js#L23 ) that uses an LRU cache for the history and is aggressive about removing things from it.

Even going as far as cleaning up the internal vector clock ( https://github.com/Raynos/expiry-model/blob/master/index.js#L44 ) when there is no more data left for that particular device.

I think this is fine but requires some thought about what's the best way to do this safely.

DamonOehlman commented 11 years ago

@Raynos thanks for the feedback. I'll take a look through expiry-model and take some cues from your implementation.

dominictarr commented 11 years ago

I think this is okay, as long as you realize what is happening. The one problem that I can see is that if a certain user disconnects, then the history is cleaned up, and then they are reconnected, the disconnected user might resend the old data.

This needs to be handed consistently, basically, all nodes need to be able to make the same decision about what to discard.

Probably, you want something like clear data from users who have not posted within a given amount of time, if there has been a lot of data since then.

It's also possible to store the data on the client, which will mean less data needs to be sent.

I'd probably edge towards removing old data after a time period has elapsed, and only if there is a lot of data. (then, if the room is quiet, you won't loose anything - you only discard things when it gets busy, when people are less likely to notice)

Another important thing is to keep a consistent ID for each user. Use the https://github.com/dominictarr/udid module, that will save the user's ID in local storage, so that when they reconnect they will get the same id, and the vector clock won't grow too large.

Note: this needs to create a new id per vm, so if they open your app in two tabs, they'll get a different ID per tab. This is necessary, because the tabs do not share memory. When they reoped tabs, they will reuse the ids, however.

I have a module (https://npmjs.org/package/tab-stream) that implements streams between tabs, which you may find useful.