Open cafaxo opened 1 year ago
julia> function foo_debug(y, x, b)
return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
y[i, c, n] = x[i, c, n] + b[c]
end
return nothing
end
foo_debug (generic function with 1 method)
julia> ls = foo_debug(y,x,b);
julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)
It was ordering the loops c
, n
, i
, probably to try and minimize the amount of times you reload from b
.
So foo!
does use fewer CPU instructions than foo2!
:
julia> using LinuxPerf
julia> foreachf(f::F, N::Int, arg1::A, args::Vararg{Any,K}) where {F,A,K} = foreach(_ -> Base.donotdelete(f(arg1, args...)), 1:N)
foreachf (generic function with 1 method)
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(foo!, 10_000, y, x, b)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 6.11e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.54e+09 33.3% # 1.5 s
┌ instructions 7.56e+08 66.7% # 0.1 insns per cycle
│ branch-instructions 2.38e+07 66.7% # 3.2% of insns
└ branch-misses 2.47e+04 66.7% # 0.1% of branch insns
┌ L1-dcache-load-misses 6.69e+08 33.3% # 200.9% of dcache loads
│ L1-dcache-loads 3.33e+08 33.3%
│ cache-misses 1.26e+05 33.3% # 0.0% of cache refs
└ cache-references 6.51e+08 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
julia> @pstats "(cpu-cycles,task-clock),(instructions,branch-instructions,branch-misses), (L1-dcache-load-misses, L1-dcache-loads, cache-misses, cache-references)" begin
foreachf(foo2!, 10_000, y, x, b)
end
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
┌ cpu-cycles 6.11e+09 33.3% # 4.0 cycles per ns
└ task-clock 1.53e+09 33.3% # 1.5 s
┌ instructions 8.91e+08 66.7% # 0.1 insns per cycle
│ branch-instructions 2.60e+07 66.7% # 2.9% of insns
└ branch-misses 3.13e+04 66.7% # 0.1% of branch insns
┌ L1-dcache-load-misses 6.57e+08 33.3% # 157.4% of dcache loads
│ L1-dcache-loads 4.17e+08 33.3%
│ cache-misses 7.52e+04 33.3% # 0.0% of cache refs
└ cache-references 6.55e+08 33.3%
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
I'm running them 10_000 times to get a decent sample.
Note that foo2!
took 8.91e4
instructions per call, while foo!
required only 7.56e4
.
@turbo
doesn't seem to properly be considering that traversing the memory of y[i, c, n]
and x[i, c, n]
in the c, n, i
order is potentially more costly than n, c, i
.
"Potentially", FWIW,
julia> @btime foo!($y, $x, $b)
149.690 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y, $x, $b)
148.181 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y, $x, $b)
148.989 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y, $x, $b) # foo3! but with `@inbounds`
149.214 μs (0 allocations: 0 bytes)
Notice the pitiful 0.1 insns per cycle
from my above @pstats
reports. That means the CPU is only executing one instruction every 10 clock cycles. This is really bad, this CPU is theoretically capable of up to 4 instructions/cycle, 40x higher.
Why so few instructions/cycle? Because in all 4 versions of the code, the CPU spends all of its time waiting on memory. Memory bandwidth is the bottleneck, so what the CPU is doing/how much work it has to do is totally irrelevant.
Still, it's the same amount of memory being loaded, and I don't see that foo!
(doing an order meant to minimize the number of instructions needed) is any worse than the others (which traverse in a theoretically nicer order).
Unfortunately, LinuxPerf
is Linux-only.
But I can reproduce the timings you reported:
julia> @btime foo!($y, $x, $b)
133.917 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y, $x, $b)
51.708 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y, $x, $b)
55.083 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y, $x, $b)
55.083 μs (0 allocations: 0 bytes)
julia> function foo_debug(y, x, b)
return LoopVectorization.@turbo_debug for i in axes(y, 1), c in axes(y, 2), n in axes(y, 3)
y[i, c, n] = x[i, c, n] + b[c]
end
return nothing
end
foo_debug (generic function with 1 method)
julia> ls = foo_debug(y,x,b);
julia> LoopVectorization.choose_order(ls)
([:c, :n, :i], :n, Symbol("##undefined##"), :i, 4, -1)
The M1 has much higher memory bandwidth than most x64 CPUs, including mine. Hence it's able to do much better here, at least given the nicer order.
If I shrink the size of the arrays on my x64 computer (the first one I benchmarked above), making them 8x smaller, the code runs about 30x faster, as memory becomes less of an issue:
julia> y32 = randn_stridearray(64, 32, 32);
julia> x32 = randn_stridearray(64, 32, 32);
julia> @btime foo!($y32, $x32, $b)
5.777 μs (0 allocations: 0 bytes)
julia> @btime foo2!($y32, $x32, $b)
4.614 μs (0 allocations: 0 bytes)
julia> @btime foo3!($y32, $x32, $b)
5.007 μs (0 allocations: 0 bytes)
julia> @btime foo4!($y32, $x32, $b)
5.084 μs (0 allocations: 0 bytes)
and now I can also see on this computer that foo!
seems to be worse than foo2!
.
If you want to try fiddling with it:
https://github.com/JuliaSIMD/LoopVectorization.jl/blob/2776b6dc5b8b0cd83d66a11f612019efabb3905a/src/modeling/determinestrategy.jl#L808
See the 10.0
there?
Try making it larger.
That's a coefficient on how heavily to penalize iterating over loops in a bad order. Increasing it to 50.0
seems to fix it:
julia> LoopVectorization.choose_order(ls)
([:n, :c, :i], :c, Symbol("##undefined##"), :i, 4, -1)
Now n
would be the outer-most loop. You'd have to actually force recompilation after making a change like that (e.g. restarting Julia, or editing _turbo_!
, such as by commenting out or in a useless line like 1+1
).
With that, I now get:
julia> @btime foo!($y32, $x32, $b)
4.620 μs (0 allocations: 0 bytes)
for the size(..., 3) == 32
example.
I can make a PR, or you're welcome to play with it to try and find a heuristic that works well.
BTW, have you seen VectorizedRNG.jl? On the M1:
julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)
julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2344 samples with 1 evaluation.
Range (min … max): 1.969 ms … 3.901 ms ┊ GC (min … max): 0.00% … 44.07%
Time (median): 2.078 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.132 ms ± 267.936 μs ┊ GC (mean ± σ): 2.60% ± 7.65%
▃ ▇█▂
█▇▆████▇▆▅▅▃▆▃▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▅▅▃▄▆▅▄▆▅▃▃▁▁▁▁▃▄▇▇▆▅▅▄▆▆ █
1.97 ms Histogram: log(frequency) by time 3.54 ms <
Memory estimate: 2.00 MiB, allocs estimate: 6.
julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 6752 samples with 1 evaluation.
Range (min … max): 622.333 μs … 3.232 ms ┊ GC (min … max): 0.00% … 77.36%
Time (median): 708.791 μs ┊ GC (median): 0.00%
Time (mean ± σ): 739.788 μs ± 165.578 μs ┊ GC (mean ± σ): 4.59% ± 10.53%
▄ ▃█▆▂ ▁ ▁
█▅▃▃████▇▄▅▄▄▄▄▃▅▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▁▃▇▇▅▄▅█▆█▆▅▁▁▁▁▃▃▃▃██ █
622 μs Histogram: log(frequency) by time 1.55 ms <
Memory estimate: 2.00 MiB, allocs estimate: 10.
On my Linux x64 machine with AVX512:
julia> vrandn_stridearray(size...) = StrideArray(randn(local_rng(), Float32, size...), static.(size))
vrandn_stridearray (generic function with 1 method)
julia> @benchmark randn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 2349 samples with 1 evaluation.
Range (min … max): 2.053 ms … 4.074 ms ┊ GC (min … max): 0.00% … 37.17%
Time (median): 2.092 ms ┊ GC (median): 0.00%
Time (mean ± σ): 2.126 ms ± 196.034 μs ┊ GC (mean ± σ): 1.46% ± 5.89%
█▅
▃███▇▄▄▁▄▁▃▃▃▄▆▅▁▁▁▁▁▃▄▃▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▄▃▆▅▅▅▅▅▅▄▃ █
2.05 ms Histogram: log(frequency) by time 3.29 ms <
Memory estimate: 2.00 MiB, allocs estimate: 6.
julia> @benchmark vrandn_stridearray(64, 32, 256)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
Range (min … max): 195.751 μs … 2.561 ms ┊ GC (min … max): 0.00% … 68.61%
Time (median): 202.900 μs ┊ GC (median): 0.00%
Time (mean ± σ): 235.975 μs ± 202.390 μs ┊ GC (mean ± σ): 12.61% ± 12.55%
█ ▁
██▆▄▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▆▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▃▇██ █
196 μs Histogram: log(frequency) by time 1.48 ms <
Memory estimate: 2.00 MiB, allocs estimate: 2.
Thanks for the elaborate answer! I'll try playing with that constant. But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.
I haven't tried VectorizedRNG but that could be extremely useful for a different project I am working on (sampling a distribution using random-walk metropolis).
But yeah, I am trying to train a convolutional neural network for chess on the CPU and memory seems to be quite the bottleneck. I'll have to check with perf on my linux machine.
Uses batch sizes small enough to fit in cache should help, if possible. Have you looked at SimpleChains.jl at all?
It could use more documentation and API need works (especially around memory management), but it does have a convolutional network example that performs much better than Flux or PyTorch do on the CPU, at least for that (small) problem size.
I just tested it: Smaller batch sizes are indeed faster! (That's exactly the opposite behavior I see compared to pytorch+GPU). I first looked at SimpleChains.jl and the performance of that convinced me that I could reach better training speed with my use case on the M1's CPU compared to pytorch with the GPU backend. However, I needed features such as zero-padding and residual connections which SimpleChains does not support. Some parts of SimpleChains seemed somewhat convoluted (pun intended). You seem to be doing some nontrivial stuff for memory management, right? In any case, in order to get a better intuition what the backward passes looked like for my network, I decided to roll my own forward+backward passes on top of StrideArrays+LoopVec.
With SimpleChains, I also ran into this error: https://julialang.slack.com/archives/C690QRAA3/p1668460255442179 Should I open an issue for that? Also, by the way: The way that the transpose for 2d convolution is implemented in SimpleChains (manually checking for inbounds in the inner for loop) is super slow on my machine. I am getting much better performance by first copying into a preallocated zero-padded array and then doing the convolution without the inbounds check.
Consider the following:
I would expect
foo2
to perform slightly worse or the same asfoo
. However, the result isWhat is happening here? Am I doing something wrong? If it matters: I am running this on an Apple M1.
Edit:
A plain for loop is about as fast as
foo2!
: