Shuffling: add F+tree sampler

mratsim commented 5 years ago

This adds another shuffling alternative that was taken from natural language processing.

Papers:

A New Data Structure for Cumulative Frequency Tables

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.14.8917&rep=rep1&type=pdf A Scalable Asynchronous Distributed Algorithm for Topic Modeling
https://www.cs.utexas.edu/~rofuyu/papers/nomad-lda-www.pdf

The algorithm is also described in this blog post Heaps for incremental computation with an alternative Python implementation using 1-based indexing (and 2x more memory if n is a power of 2).

Bench on my machine:

Testing prime shuffle
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
Total runtime:  1.651440143585205
Runtime to compute committee:  0.02932882308959961

Testing feistel shuffle
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
Total runtime:  0.15117597579956055
Runtime to compute committee:  0.005218982696533203

Testing Fisher-Yates shuffle
[93044, 39644, 88989, 78137, 17662, 17187, 41433, 85069, 64061, 6647]
Total runtime:  0.1256880760192871

Testing F+tree sampling
[93044, 39644, 88989, 78137, 17662, 17187, 41433, 85069, 64061, 6647]
[50405, 51879, 54080, 30514, 76290, 60097, 83697, 19454, 68194, 85273]
[75628, 50172, 45023, 21349, 80036, 81698, 2829, 994, 88511, 20356]
Total runtime:  1.5932817459106445
Runtime to compute first committee:  0.053829193115234375
Runtime to compute next committee:  0.008832931518554688

vbuterin commented 5 years ago

Not sure I understand; how does this accomplish the property that you can compute any arbitrary value in the shuffling in O(1) time? This seems like an incremental approach that requires computing the 1...n'th values before you can compute the n+1'th.

mratsim commented 5 years ago

That was actually sparked when you were looking into the prime shuffle but it was quite slow. So I researched alternatives. However the Feistel shuffle that was proposed within the same day definitely supersedes this one.

ethereum / research

Shuffling: add F+tree sampler #98