FourierFlows / FourierFlows.jl

Tools for building fast, hackable, pseudospectral partial differential equation solvers on periodic domains
https://bit.ly/FourierFlows
MIT License
207 stars 29 forks source link

Use FastBroadcast.jl to speedup broadcasting #338

Open navidcy opened 1 year ago

navidcy commented 1 year ago

Closes #337

glwagner commented 1 year ago

benchmark?

navidcy commented 1 year ago

I did one and it turned out same! Working on it...

glwagner commented 1 year ago

Maybe that means the operations can't benefit from further compiler optimization.

As for multithreading (the other feature of FastBroadcast), you have to write

@.. thread=true

to make it multithreaded:

https://github.com/YingboMa/FastBroadcast.jl

But even that may not yield a speed up because it depends on the relative cost of transforms vs broadcasts.

doraemonho commented 1 year ago

I think I may know part of the reasons. To get a speed up from multithreading, we have to under the compute-bound regime, in which processing speed is the limit. If not, we may under the memory-bound regime, in which the speed of moving the data from memory to CPU is the bottleneck.

For @. A = B + C, CPU may spend more time in moving the data than computing the B + C. As a result, we dont get a speed-up for large arrays. (Btw, you will still get a speed-up if A is small enough)

For example, for my 5900x (8 threads),

using BenchmarkTools
using FastBroadcast

# "Small" Array Test
A = abs.(rand(100,100,100));
B = copy(A);
C = copy(A);

@btime @. A = B + C;
# 185.929 μs (2 allocations: 64 bytes)
@btime @.. thread=true A = B + C;
# 28.839 μs (2 allocations: 64 bytes)

# "Large" Array Test
A = abs.(rand(300,300,300));
B = copy(A);
C = copy(A);

@btime @. A = B + C;
#21.720 ms (2 allocations: 64 bytes)

@btime @.. thread=true A = B + C;
#20.708 ms (2 allocations: 64 bytes)

However, if we are under the compute-bound regime, such as @. A= 0.1*log10(B) + 0.25*log(C), since log10 is a computationally expensive operation, the CPU will actually spend more time on computing rather than moving data. So, we would expect a descant speed up.

@btime @. A = 0.1*log10(B) + 0.25*log(C);
# 240.630 ms (10 allocations: 288 bytes)
@btime @.. thread=true A = 0.1*log10(B) + 0.25*log(C);
#35.817 ms (10 allocations: 288 bytes)

Nonetheless, for small 2D problem, I think @.. will still benfit the computation

glwagner commented 1 year ago

@doraemonho thank you for that very clear explanation! It's fortunate there isn't a slowdown in the memory-bound regime...

Either way, I suppose this shows that this package may be more important downstream (eg GeophysicalFlows or users of GeophysicalFlows), where we might encounter more complex operations like log10.