Open jbrezmorf opened 3 years ago
I would wait until we deal with all the changes in othe pull requests, there are definitely changes that affect Quantity.subsample
Online subsampling algorithms discussed by Vitter
Suppose selection n items from N. The proposed online subsampling algorithm is correct and first proposed by Knuth, that call it S sampling with N random numbers. The improved version is A sampling, with $n$ random numbers.
However we may want to use numpy.random.choice
optimized implementation
(source).
That means:
choice(chunk, size = n/N*len(chunk)
and then actualize n
and N
.
It seems (according experiments) that chunked version of the S-algorithm doesn't work since probability is not correctly updated through the chunk. But for limit cases with chunk size 1 or N it works. Size must be a random variable with suitable distribution.
Chunk subset size passed as an argument to the choice function should be a random variable with the hypergeometric distribution. The discrete variable k
is the size of the chunk subset, parameters are: N
- total number of samples, K
- total size of subset, n
- the chunk size.
Reimplement
Quantity.subsample
:Target usage is bootstrapping.
This makes #35 deprecated.