JuliaStats / StatsBase.jl

Basic statistics for Julia
Other
583 stars 194 forks source link

Improvement of weighted direct_sample! by searching in increasing order of sample? #728

Open mehdibida opened 3 years ago

mehdibida commented 3 years ago

Would it be interesting to use a version of the weighted direct_sample! that searches for the uniformly sampled numbers in order?

Something that looks like this:

function direct_sample!(rng::AbstractRNG, a::AbstractArray,
                        wv::AbstractWeights, x::AbstractArray)    
    n = length(a)    
    length(wv) == n || throw(DimensionMismatch("Inconsistent lengths."))   
    t = rand(rng, length(x)) * sum(wv)    
    t_sp = sortperm(t)    
    i = 1    
    cw = wv[1]    
    for s_t=1:length(x)   
        while cw < t[t_sp[s_t]] && i < n    
            i += 1    
            @inbounds cw += wv[i]    
        end    
        @inbounds x[t_sp[s_t]] = a[i]    
    end    
    return x    
end

This should be in O(k log(k) + n) where k=length(x) and n=length(a) instead of O(nk), assuming that an efficient (generic) sorting algorithm is used.

nalimilan commented 3 years ago

Thanks for the suggestion. I imagine that could be interesting if you can show benchmarks where it's faster and references and thorough tests that confirm its correct.

mehdibida commented 3 years ago

Hello,

I uploaded the sorting algorithm and its analysis in this notebook: https://github.com/mehdibida/Weighted_Sampling/blob/main/SampleSort_Test_Benchmark.ipynb

Is this sufficient?

Another question: I thought of an improvement that seems to be faster than the alias method in most cases, although I am not a 100% sure of it working correctly, is this the right place to discuss it, or better do it on discourse? Thanks!