GeoMop / MLMC

5 stars 2 forks source link

Bootstrapping, subsample quantity. #144

Open jbrezmorf opened 3 years ago

jbrezmorf commented 3 years ago

Reimplement Quantity.subsample:

Target usage is bootstrapping.

This makes #35 deprecated.

martinspetlik commented 3 years ago

I would wait until we deal with all the changes in othe pull requests, there are definitely changes that affect Quantity.subsample

jbrezmorf commented 3 years ago

Online subsampling algorithms discussed by Vitter

Suppose selection n items from N. The proposed online subsampling algorithm is correct and first proposed by Knuth, that call it S sampling with N random numbers. The improved version is A sampling, with $n$ random numbers.

However we may want to use numpy.random.choice optimized implementation (source). That means: choice(chunk, size = n/N*len(chunk) and then actualize n and N.

jbrezmorf commented 3 years ago

It seems (according experiments) that chunked version of the S-algorithm doesn't work since probability is not correctly updated through the chunk. But for limit cases with chunk size 1 or N it works. Size must be a random variable with suitable distribution.

jbrezmorf commented 3 years ago

Chunk subset size passed as an argument to the choice function should be a random variable with the hypergeometric distribution. The discrete variable k is the size of the chunk subset, parameters are: N - total number of samples, K - total size of subset, n - the chunk size.