bmizerany / perks

Effective Computation of Things
BSD 2-Clause "Simplified" License
186 stars 53 forks source link

merge is invalid #15

Open kurtbrose opened 9 years ago

kurtbrose commented 9 years ago

The merge operation is not valid for this algorithm. It is not a distributed algorithm, and intermediate results from two streams cannot be combined.

The definition of g from the paper, Delta in the go code: "gi is the difference between the lowest possible rank of item i and the lowest possible rank of item i − 1"

Consider two sequences, with stored nodes marked with ():

...(1)111(3)...

...0(2)...

The node with value=3 will have g/Delta of 4. After the lists are merged:

...0(1)111(2)(3)...

The node with value=3 still has g/Delta of 4, but this is incorrect. The actual g/Delta should be 1. The invariant of the algorithm is broken, and therefore the guaranteed error bounds are also broken.

If there are no longer guaranteed error bounds, you might as well use a reservoir sample.

The fix is to remove the merge() operation. (Multiple goroutines would need to either push points into a queue, or use a lock around the data structure, or some other fancy architecture.)

Q-digest is an example of a distributed algorithm in this space: http://www.cs.virginia.edu/~son/cs851/papers/ucsb.sensys04.pdf

Edit: the original authors of the paper also extended it to be a distributed algorithm. http://www.cis.upenn.edu/~mbgreen/papers/pods04.pdf The data structure used is a bit different, but similar. This would require a total re-write of the core algorithm, but would allow for a valid merge() operation.