using LSEQ allocation strategy

marcelklehr commented 8 years ago

Hey dominic!

I recently stumbled upon https://hal.archives-ouvertes.fr/file/index/docid/921633/filename/fp025-nedelec.pdf and have tried to compose a PR for the original between. Then, I saw this, so while this is in flux, (and perhaps you've already seen the paper) I thought I'd bring this to your attention.

The drawback, or rather complication of the LSEQ approach is that somehow the choice of allocation method needs to be remembered. I've tried to circumvent this, but I'm not sure the result is ideal.

dominictarr commented 8 years ago

no, I havn't seen that paper. can you summarize how it differs from this module?

marcelklehr commented 8 years ago

The paper's subject is the problem of finding an allocation strategy for the CRDT identifiers that produces short identifiers on average, to save space. I believe the original between used a middle strategy, computing the arithmetic mean between two identifiers (I'm not sure about lexiographic between). The LSEQ paper proposes a strategy that combines insert at beginning and insert at end, the rationale always being that the space of a single level should be used as exhaustively as possible. While both (insert at beginning and insert at end) alone are not useful (they have an "antagonist weakness": beginning is not useful for front editing, end is not useful for end editing), LSEQ combines them by choosing randomly between the two whenever a new level in the identifier tree is created. So, it kind of adapts to all behavior, even though it's sacrificing, of course, some levels.

As I said above, though, the choice of strategy needs to be stored somewhere for all levels of the tree, according to the paper. This is not very beautiful, IMO, and I've been trying to get around this by doing some clever inspection of the identifiers fed to between.

Another useful feature of LSEQ is base doubling which will probably come in handy for lexiographic between: Instead of starting with the full base (i.e. lo=0 and hi=Number.MAX_INT) for the first level of identifiers, you start with a small base and double it at each new level. This way the identifier size on each level adapts to the size of the document: If there are many insertions, the identfier size autmatically increases, otherwise you get short ones.

So, tl;dr feature #\1 is the dual strategy for choosing the positions of the identifiers on one level, feature #\2 is base doubling to avoid long identifiers for small documents and areas where nothing happens.

dominictarr commented 8 years ago

Okay that makes sense, and thanks for the great summary! @noffle and I where hacking on this he made: https://github.com/noffle/bisecting-between and I also measured relative performance with: https://github.com/dominictarr/between-experiment

(I just realized that I forgot to push some stuff and my current experiment scripts are broken... fixing now)

We also had a detailed discussion about it, but not in plainspace... you'll have to join us in cypherspace. %BjUcP6Zg1FEMObv3OxGQXkGiPGaSLY/VT+IKmFk7YSw=.sha256 access via https://github.com/ssbc/patchwork

Let me know and I'll generate you an invite code. Oh, and someone mentioned that paper (but didn't a good job summarizing it, like you did ;)

In my experiment I compare relative performance with a simple model that assumes edits are appends, but jumps to another position with some probability. It seems likely that real text editing is not gonna be uniformly random, but it does seem likely that an algorithm that works for uniform random will work for real edits.

I did try to figure out a algorithm which could automatically tradeoff between random inserts and appends. I didn't consider prepends though... but I guess that would be a pathological case in my model... hmm

dominictarr commented 8 years ago

okay I fixed the scripts, output is between-experiment repo. I figure it ought to be possible to have an algorithm which has at least a lower bound of the most efficient algorithm on those graphs, by deciding which is the right strategy to use at a time and then using it.

marcelklehr commented 8 years ago

Sweet! I don't quite get the axes, though -- isn't a lower number of chars per edit (and thus shorter identifiers) better? Anyway, I just pushed my implementation of LSEQ: https://github.com/marcelklehr/lseq-between (I didn't implement base doubling, yet).

-- I've just installed patchwork-electron. Do I need to set up a publicly accessible pod, too?

dominictarr commented 8 years ago

@marcelklehr no you don't have to run a server, just use the code I just sent you.

Y is always edit length, yes. smaller is better. on the first 4 graphs, X is number of edits. on the last graph, Y is probability of a jump (read that graph as average edit length as edit style changes from "appendy" to "inserty")

marcelklehr commented 8 years ago

Should be good for an experiment now. I implemented base doubling + the alternating strategy atop of your numarray version.

I also get the graphs now, I interpreted the legend the wrong way... :)

dominictarr commented 8 years ago

@marcelklehr can you npm publish it?

marcelklehr commented 8 years ago

done

dominictarr / lexiographic-between

using LSEQ allocation strategy #2