Possible alternative numer-theoretic shuffling algorithm

vbuterin commented 5 years ago

Motivation

Construct a shuffling algorithm where you can compute the value in the shuffled list at any specific position relatively cheaply without computing all of the other values at the same time. This could be used to replace the current shuffle.

Helpers

def is_prime(x):
    return [i for i in range(2, int(x**0.5)+1) if x%i == 0] == []

Algorithm

def value_at_position(n, value, seed):
    # We do the shuffling mod p, the lowest prime >= n, but if we actually shuffle into
    # the "forbidden" [n...p-1] slice we just reshuffle until we get out of that slice
    p = n 
    while not is_prime(p):
        p += 1 
    # x -> x**power is a permutation mod p
    power = 3 
    while (p-1) % power == 0 or not is_prime(power):
        power += 2 
    for round in range(40):
        a = int.from_bytes(seed[(round % 8)*4: (round % 8)*4 + 4], 'big')
        value = (pow(value, power, p) + a) % p
        while value >= n:
            value = (pow(value, power, p) + a) % p
        # Update the seed if needed
        if round % 8 == 0:
            seed = hash(seed)
    return value

def shuffle(values, seed):
    return [values[value_at_position(len(values), i, seed)] for i in range(len(values))]

Note that the above is a maximally simple definition, not an optimal implementation. An optimal implementation would calculate the prime, the exponent, and the a values once, and pass them as inputs to value_at_position, which would then need to do much less work per value.

The main weaknesses are: (i) the computation of the total shuffle is slower even with optimizations, and (ii) the security argument is more heuristic ("it repeats x -> x**3 + k[i] so it's like MIMC") than the current Fisher-Yates shuffle. However, the strengths may be worth it, especially if we wish to cease storing the shuffling in the state.

mratsim commented 5 years ago

What are the security properties that are needed?

Quick analysis of the bottlenecks

The main bottleneck in the current Fisher-Yates shuffling is the repeated modulo in the loop as it's an expensive operation. In the new proposed shuffling there are gcd, pow and modulo.

In my opinion the main advantage of Fisher-Yates is the ability to do it in-place, however I suppose to be able to rollback state, most clients will not implement it in-place but in a temporary location instead.

Alternative

In case we need a temporary location, I think reservoir sampling will be superior in terms of performance while preserving equi-probabilities of all sequences.

The algorithm is very simple and doesn't involve expensive operation, from Wikipedia (array indexing starts at one)

ReservoirSample(S[1..n], R[1..k])
  // fill the reservoir array
  for i = 1 to k
      R[i] := S[i]

  // replace elements with gradually decreasing probability
  for i = k+1 to n
    j := random(1, i)   // important: inclusive range
    if j <= k
        R[j] := S[i]

We only need to couple it with a prng. For example XorShift128+ is used by all major browsers (and could be a eWASM primitive):

#include <stdint.h>

/* The state must be seeded so that it is not all zero */
uint64_t s[2];

uint64_t xorshift128plus(void) {
    uint64_t x = s[0];
    uint64_t const y = s[1];
    s[0] = y;
    x ^= x << 23; // a
    s[1] = x ^ y ^ (x >> 17) ^ (y >> 26); // b, c
    return s[1] + y;
}

Or the latest Xoroshiro128+ by the same author, which is significantly faster and with much better statistical quality.

Benchmark, tests and paper

Implementation in Nim (disclaimer this is the default Nim PRNG):

type
  Rand* = object ## State of the random number generator.
                 ## The procs that use the default state
                 ## are **not** thread-safe!
    a0, a1: ui

var state = Rand(  # <---- We can use our own seed instead
  a0: 0x69B4C98CB8530805u64,
  a1: 0xFED1DD3004688D67CAu64)

proc rotl(x, k: ui): ui =
  result = (x shl k) or (x shr (ui(64) - k))

proc next*(r: var Rand): uint64 =
  ## Uses the state to compute a new ``uint64`` random number.
  let s0 = r.a0
  var s1 = r.a1
  result = s0 + s1
  s1 = s1 xor s0
  r.a0 = rotl(s0, 55) xor s1 xor (s1 shl 14) # a, b
  r.a1 = rotl(s1, 36) # c

Advantage

This is much more hardware friendly.

Drawback

Like the current Fisher-Yates, you need to keep a state.

Papers

Random Sampling with a reservoir, 1985, Vitter
An efficient algorithm for sequential random sampling, 1987, Vitter
- This refines the previous paper and introduces gap sampling
Scrambled Linear Pseudorandom Number Generators, 2018, Blackman, Vigna

mratsim commented 5 years ago

I've been thinking over this and unfortunately reservoir sampling wouldn't work as is.

We need a reservoir that hold the final selection, so it must be the size of the list to shuffle, i.e. we would start with the list and then we remove (~sample) elements randomly from it, so basically it just degrade into random sampling without replacement. However when doing random sampling the data structure used is quite important and a list/vector is not the most efficient one (in my benchmarks a Fenwick tree is very efficient) but if we need to reorganize into a new data structure we might as well do copy+Fisher Yates.

An alternative scheme would be to have a smaller reservoir (say 8 items) and add items ejected from the reservoir to the final list. Unfortunately even if we start from a random point in the original list, the order would be correlated to the original order.

So in short I think reservoir sampling is not suitable.

vbuterin commented 5 years ago

The main desired property is being maximally close to a random permutation. Specifically, it should be maximally computationally difficult to find seeds that ensure that in a shuffling of 1...N, any specific subset x1, x2, x3.... x[k] are in a specific range [v ..... v+k].

The reason why the number-theoretic shuffle above is interesting is that you don't need to compute the whole output to compute a small portion of the output. This makes it much easier to use in light clients, and to verify blocks generally.

vbuterin commented 5 years ago

I just optimized the code a bit; it can shuffle 100000 values in one second (vs 1 million in one second for the Fisher-Yates shuffle). That said, the hash complexity is smaller, so I suspect in an optimized C implementation the number-theoretic shuffle will not be much smaller than out current Fisher-Yates shuffle, as in python the hashes are optimized but the rest of the code is not, leading to hashing costs being understated.

vbuterin commented 5 years ago

Here is a hash-based alternative. Here's the core:

def numhash(x, i, seed, modulus):
  assert 0 <= i < 4
  return (int.from_bytes(hash(x.to_bytes(32, 'big') + seed), 'big') // modulus**i) % modulus

def feistel(x, modulus, seed):
   assert is_perfect_square(modulus) and modulus < 2**50
   h = int(modulus ** 0.5)
   L, R = x//h, x%h
   for i in range(4):
    new_R = (L + numhash(R, i, seed, h)) % h
    L = R
    R = new_R
   return L * h + R

Now for the actual function, let modulus be the smallest perfect square greater than or equal to n. To permute some value, set value = feistel(value, modulus, seed) and if needed repeat until the result is less than n, just as above. This takes 4 hashes to compute for a single number, but to permute an entire list you only need ~sqrt(n) hashes.

vbuterin commented 5 years ago

I implemented both options and the status quo here: https://github.com/ethereum/research/tree/master/shuffling

Here's the timing test output, testing both computing a full shuffle of 100000 and computing just a specific sub-committee of 500 out of the 100000:

Testing prime shuffle
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
Total runtime:  1.0141525268554688
Runtime to compute committee:  0.022953271865844727

Testing feistel shuffle
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
Total runtime:  0.10488557815551758
Runtime to compute committee:  0.00408935546875

Testing Fisher-Yates shuffle (status quo)
[93044, 39644, 88989, 78137, 17662, 17187, 41433, 85069, 64061, 6647]
Total runtime:  0.08116364479064941

mratsim commented 5 years ago

Feistel shuffling is impressively fast.

I've transformed the current state-of-the-art sampling technique from natural language processing into a shuffling algorithm in https://github.com/ethereum/research/pull/98.

Originally it was for weighted sampling hence why it is rearranging input to efficiently check cumulative probability distributions.

There is an initialisation cost but further sampling are 6.5x cheaper.

Testing prime shuffle
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
[40388, 24854, 44555, 69180, 37292, 27818, 85124, 51675, 75163, 16592]
Total runtime:  1.651440143585205
Runtime to compute committee:  0.02932882308959961

Testing feistel shuffle
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
[82855, 3100, 89704, 87662, 7830, 16014, 57626, 95313, 53632, 97853]
Total runtime:  0.15117597579956055
Runtime to compute committee:  0.005218982696533203

Testing Fisher-Yates shuffle
[93044, 39644, 88989, 78137, 17662, 17187, 41433, 85069, 64061, 6647]
Total runtime:  0.1256880760192871

Testing F+tree sampling
[93044, 39644, 88989, 78137, 17662, 17187, 41433, 85069, 64061, 6647]
[50405, 51879, 54080, 30514, 76290, 60097, 83697, 19454, 68194, 85273]
[75628, 50172, 45023, 21349, 80036, 81698, 2829, 994, 88511, 20356]
Total runtime:  1.5932817459106445
Runtime to compute first committee:  0.053829193115234375
Runtime to compute next committee:  0.008832931518554688

JustinDrake commented 5 years ago

Benedikt Bunz points to this paper as a potential candidate.

drozdziak1 commented 5 years ago

Is it certain that the present shuffling approach is going to be ditched?

JustinDrake commented 5 years ago

@drozdziak1 I'd say 80% :) Light-client friendly shuffles seem like a big win, and they are possible. It's now an engineering/cryptography question to find a really efficient one, maybe even a SNARK/STARK-friendly one.

esaulpaugh commented 5 years ago

It sounds like a CSPRNG is a requirement. Anything less just seems sketchy when deliberate attacks are expected. As they say, attacks only get better

th4s commented 5 years ago

@vbuterin have you run some random number tests (e.g. dieharder) against the primenumber implementation or is this algorithm some well-known CSPRNG?

djrtwo commented 5 years ago

closed by adding "swap or not" #563

ethereum / consensus-specs