Add support for output pre-allocation

wuhuangjianCN commented 1 year ago

It would be great to add a running_median！function to support pre-allocation of the outputs, because the gc time could account for 50% for a large amount of series. Or, add support for multiple series and enable multi-threads?

Firionus commented 1 year ago

I get the feeling you want to optimize performance for a very specific use-case. If so, could you provide a minimum example that we can use for benchmarking?

As for your specific concerns:

Feel free to implement a running_median!(output, input, window_size, tapering=:symmetric) function and put up a PR. It should probably throw an ArgumentError if the length or data type of the output does not fit. Currently, output data type is always Float64. Other types would be possible but I'd save that for another issue. The relevant allocating code is here: https://github.com/Firionus/FastRunningMedian.jl/blob/338b6c5c9f9bdb93a0617de8ebfea37dc6c29ad3/src/FastRunningMedian.jl#L73
Multiple Series -> Feel free to think of an API and put in a PR. I guess this is about reducing the allocation for the MedianFilter itself, rather than the output? The allocating code lines for the two heaps and the ring buffer are here: https://github.com/Firionus/FastRunningMedian.jl/blob/338b6c5c9f9bdb93a0617de8ebfea37dc6c29ad3/src/stateful_api.jl#L29-L31 Until then, you can just use the stateful API directly and re-use the MedianFilter. The only problem would be that you can't completely empty a MedianFilter right now, since shrink! errors before removing the last element. Maybe put in an optional argument that allows you to circumvent that check, so you can shrink it to 0 elements? What happens when you then call median on it? What happens when you call shrink on a zero-size MedianFilter? All of that would probably be an easy Pull Request, and I'd appreciate it 👍 .
Multithreading -> Unlikely to bring big improvements at the algorithm level, because it's very serial. It always has to maintain a heap structure for every change in the window. If you really want parallelization, split your data into parts and do running_median on each, then bring them together again. Figuring out the boundaries/overlaps will be fun though. You can do all of that without modifying this library. If you have a good implementation, feel free to put in a PR. It might be of interest to others as well.

wuhuangjianCN commented 1 year ago

I get the feeling you want to optimize performance for a very specific use-case. If so, could you provide a minimum example that we can use for benchmarking?

Sure, here is an example.

using FastRunningMedian
xall=rand(24*30*3,6,10000);
ftest(xall)=mapslices(x->running_median(x,24*15*2),xall,dims=1)
@time ftest(xall);

and the output is: 46.135678 seconds (2.21 M allocations: 6.005 GiB, 65.58% gc time)

wuhuangjianCN commented 1 year ago

2. Until then, you can just use the stateful API directly and re-use the MedianFilter. The only problem would be that you can't completely empty a MedianFilter right now, since shrink! errors before removing the last element. Maybe put in an optional argument that allows you to circumvent that check, so you can shrink it to 0 elements? What happens when you then call median on it? What happens when you call shrink on a zero-size MedianFilter? All of that would probably be an easy Pull Request, and I'd appreciate it 👍 .

I have added support for multiple series and submit a pull request. However, this is my first PR and I'm not sure if I have done it write. Let me know if you received the PR, thanks.

In the updated version, the times of allocation does not increase along with the number of series. The time consumption can be reduced by 50% when using muti-threads. Here is the test code:

using Revise
using DataStructures
includet("../src/FastRunningMedian.jl")
using Main.FastRunningMedian
X_all=rand(24*30*3,6,10000)
wSize=24*15*2+1

f1(X_all,wSize)=mapslices(x->running_median(x,wSize),X_all,dims=1);
f2(X_all,wSize)=running_median(X_all,wSize);

function f3(X_all,wSize)
    Y_all=similar(X_all)
    Threads.@threads for n_var in axes(X_all,2)
        Y_all[:,n_var,:]=mapslices(x->running_median(x,wSize),view(X_all,:,n_var,:),dims=1);
    end
    return Y_all
end

function f4(X_all,wSize)
    Y_all=similar(X_all)
    Threads.@threads for n_var in axes(X_all,2)
        Y_all[:,n_var,:]=running_median(view(X_all,:,n_var,:),wSize);
    end
    return Y_all
end

@time Y_base=f1(X_all,wSize); # single series
# 13.929382 seconds (2.21 M allocations: 6.009 GiB, 6.55% gc time)

@time Y_test=f2(X_all,wSize); # multiple series
# 13.338668 seconds (65 allocations: 1.333 GiB, 1.80% gc time)

@time Y_par=f3(X_all,wSize); # single series with mutiple threads
# 5.952394 seconds (2.27 M allocations: 6.980 GiB, 47.94% gc time)

@time Y_par=f4(X_all,wSize); # multiple series with mutiple threads
# 2.810954 seconds (383 allocations: 2.367 GiB, 0.17% gc time)

println("Y_base == Y_test ?  ", Y_base == Y_test )
# Y_base == Y_test ?  true

Firionus commented 1 year ago

About the example: I can't reproduce that large GC percentage on my laptop. Here are my results on Julia 1.9 (please ignore the exact timings, they are imprecise due to CPU temperature and boosting):

julia> @time ftest(xall);
 11.904905 seconds (1.25 M allocations: 3.918 GiB, 0.86% gc time)

julia> versioninfo()
Julia Version 1.9.0-rc1
Commit 3b2e0d8fbc1 (2023-03-07 07:51 UTC)
Platform Info:
  OS: Linux (x86_64-linux-gnu)
  CPU: 8 × Intel(R) Core(TM) i7-10510U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, skylake)
  Threads: 1 on 8 virtual cores

Similar with Julia 1.6.7:

julia> @time ftest(xall);
 12.775667 seconds (3.17 M allocations: 6.352 GiB, 2.55% gc time)

and 1.2.0:

julia> @time ftest(xall);
 12.329736 seconds (3.17 M allocations: 6.352 GiB, 2.48% gc time)

Not sure why we have such different GC behavior. :man_shrugging:

About your own code changes: I didn't get your pull request, unfortunately. It should be visible on this very repository if you click on "Pull requests" in the repository navigation bar. Mabye this guide can help you: https://opensource.com/article/19/7/create-pull-request-github

I'm looking forward to see what your API idea and code changes were, and see how we can incorporate them :blush:

Even if I didn't get your code yet, my curiosity was piqued. First, here is again your benchmark with v0.2:

julia> @time ftest(xall);
 12.553681 seconds (1.97 M allocations: 5.998 GiB, 1.20% gc time)

Then I created the avoid-allocation branch. The first improvement was to pre-size the heaps with sizehint!. This already reduced allocations by about 30 %:

julia> @time ftest(xall);
 11.873653 seconds (1.25 M allocations: 3.918 GiB, 0.62% gc time)

Then, I added running_median! where the output and median filter can be provided by the user, further reducing allocations by 60 %:

function ftest2(xall)
       output = Array{Float64, 3}(undef, (2159,6,10000))
       mf = MedianFilter(0., 24*15*2)
       for i in axes(xall,2), j in axes(xall,3)
               running_median!(mf, @view(output[:,i,j]), @view(xall[:,i,j]))
       end
       output
end

julia> @time ftest2(xall);
 11.480872 seconds (240.04 k allocations: 1.432 GiB, 0.06% gc time)

So overall, this new branch reduces allocations by over 70 %, leading to a slight performance improvement on my machine. And the results are the same of course:

julia> ftest(xall) == ftest2(xall)
true

Would you mind testing your use case with the avoid-allocation branch and running_median!?

Firionus commented 1 year ago

I'd consider this done as of https://github.com/Firionus/FastRunningMedian.jl/commit/5527221f9c93ea779ef2227739c13190986ac6ad. There is a new function running_median! which allows custom allocation for MedianFilter and the output. Also, running_median and running_median! will now take multidimensional arrays and calculate the running median independently over all columns. Pre-sizing of the heaps has also been included. I'll release the changes in the near future, but I might bundle in some more changes, since it comes with some breaking API improvements to the stateful API. Thanks for your questions, and should you need anything else in the future, feel free to open another issue :smile:

Firionus / FastRunningMedian.jl

Add support for output pre-allocation #23