Closed mtfishman closed 2 years ago
It looks like there is a performance crossover for Tuples of length 16:
julia> t = Tuple(shuffle(1:15))
(4, 3, 2, 8, 14, 9, 11, 13, 12, 7, 6, 5, 1, 15, 10)
julia> @btime TupleTools.isperm($t)
43.287 ns (0 allocations: 0 bytes)
true
julia> @btime Base.isperm($t)
0.015 ns (0 allocations: 0 bytes)
true
julia> t = Tuple(shuffle(1:16))
(13, 4, 3, 1, 5, 14, 15, 6, 9, 16, 8, 2, 7, 10, 12, 11)
julia> @btime TupleTools.isperm($t)
41.737 ns (0 allocations: 0 bytes)
true
julia> @btime Base.isperm($t)
56.686 ns (2 allocations: 128 bytes)
true
Probably hitting some inference issue. Still, the base version is generally faster. Perhaps TupleTools.isperm
could use Base.isperm
for Tuples of length 15 or less and use the current version for tuples of length 16 or more. I could make a PR if that sounds reasonable to you.
Hi Matt, thanks for the nice comments regarding Strided. I can certainly accept that Base.isperm
is faster than the one from TupleTools
in certain regimes, but I think there is something off with your timings. Sub-nanosecond timings illustrate that something is wrong. I think that for Base.isperm
in your case constant folding kicks in, and so everything is optimized away within what is being timed. There was indeed a PR in Julia Base to implement isperm
for tuples, where advanced analysis was performed and I think this problem also came up there. Let me see if I can recover the PR. In particular, I would expect the implementation from Base to be faster at least for larger tuples, but maybe also for smaller ones.
This is the one; I am too tired to read the discussion right now, but I will take a look tomorrow to refresh my memory: https://github.com/JuliaLang/julia/pull/35234
A right, I forgot about that. Here are some bechmarks using the Ref
trick (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#understanding-compiler-optimizations):
julia> function f(N)
t = Tuple(shuffle(1:N))
println("Length = $N")
println("TupleTools:")
@btime TupleTools.isperm($(Ref(t))[])
println("Base:")
@btime Base.isperm($(Ref(t))[])
println()
end
f (generic function with 1 method)
julia> function f(N)
t = Tuple(shuffle(1:N))
println("Length = $N")
println("TupleTools:")
@btime TupleTools.isperm($(Ref(t))[])
println("Base:")
@btime Base.isperm($(Ref(t))[])
println()
end
f (generic function with 1 method)
julia> for n in 1:20 f(n) end
Length = 1
TupleTools:
1.179 ns (0 allocations: 0 bytes)
Base:
1.179 ns (0 allocations: 0 bytes)
Length = 2
TupleTools:
1.181 ns (0 allocations: 0 bytes)
Base:
1.181 ns (0 allocations: 0 bytes)
Length = 3
TupleTools:
2.811 ns (0 allocations: 0 bytes)
Base:
2.110 ns (0 allocations: 0 bytes)
Length = 4
TupleTools:
4.679 ns (0 allocations: 0 bytes)
Base:
2.112 ns (0 allocations: 0 bytes)
Length = 5
TupleTools:
5.608 ns (0 allocations: 0 bytes)
Base:
4.482 ns (0 allocations: 0 bytes)
Length = 6
TupleTools:
8.437 ns (0 allocations: 0 bytes)
Base:
7.033 ns (0 allocations: 0 bytes)
Length = 7
TupleTools:
12.142 ns (0 allocations: 0 bytes)
Base:
9.596 ns (0 allocations: 0 bytes)
Length = 8
TupleTools:
16.163 ns (0 allocations: 0 bytes)
Base:
9.561 ns (0 allocations: 0 bytes)
Length = 9
TupleTools:
20.294 ns (0 allocations: 0 bytes)
Base:
15.329 ns (0 allocations: 0 bytes)
Length = 10
TupleTools:
21.488 ns (0 allocations: 0 bytes)
Base:
17.359 ns (0 allocations: 0 bytes)
Length = 11
TupleTools:
25.075 ns (0 allocations: 0 bytes)
Base:
21.084 ns (0 allocations: 0 bytes)
Length = 12
TupleTools:
28.421 ns (0 allocations: 0 bytes)
Base:
19.587 ns (0 allocations: 0 bytes)
Length = 13
TupleTools:
28.816 ns (0 allocations: 0 bytes)
Base:
40.336 ns (0 allocations: 0 bytes)
Length = 14
TupleTools:
36.251 ns (0 allocations: 0 bytes)
Base:
44.136 ns (0 allocations: 0 bytes)
Length = 15
TupleTools:
39.280 ns (0 allocations: 0 bytes)
Base:
51.793 ns (0 allocations: 0 bytes)
Length = 16
TupleTools:
45.356 ns (0 allocations: 0 bytes)
Base:
58.012 ns (2 allocations: 128 bytes)
Length = 17
TupleTools:
47.538 ns (0 allocations: 0 bytes)
Base:
60.237 ns (2 allocations: 128 bytes)
Length = 18
TupleTools:
55.415 ns (0 allocations: 0 bytes)
Base:
62.752 ns (2 allocations: 128 bytes)
Length = 19
TupleTools:
58.755 ns (0 allocations: 0 bytes)
Base:
63.228 ns (2 allocations: 128 bytes)
Length = 20
TupleTools:
66.417 ns (0 allocations: 0 bytes)
Base:
65.484 ns (2 allocations: 128 bytes)
Does that look reasonable? It looks like the picture is not so clear.
Ok, very interesting timings. Quite strange indeed, which Julia version is this? I would have expected the TupleTools version to become slow at some point, when inference gives up on the recursion in the TupleTools implementation.
Anyway, you can maybe try to just replace the isperm
call in Strided with Base.isperm
and see how this affect your benchmarks? I am interested in learning some more of your C++ versus Julia ITensor benchmarks, and would of course also like to compare with my own TensorKit package.
Those were with Julia v1.5.2.
I don't see any recursion in TupleTools.isperm
: https://github.com/Jutho/TupleTools.jl/blob/master/src/TupleTools.jl#L287
It looks like Base.isperm
special cases length 2, then uses recursion up to length 15, then uses a loop after that: https://github.com/JuliaLang/julia/blob/master/base/combinatorics.jl#L84
I tried replacing Base.isperm
in Strided and it didn't make much difference in my benchmarks. The tensors are order 12 with blocks of size 1 so from the benchmark above it should have been a bit faster, but it wasn't noticeable. Anyway, over all they are close enough that I don't think it makes much of a difference whether TupleTools.isperm
or Base.isperm
is used. In any more realistic operations (probably anything with arrays larger than 1), the actual permutation will dominate over the cost of isperm
, and I could just make a special case in my code to not do a permutation if the block is size 1.
We are in the process of finishing up some benchmarks for the ITensor paper, I can let you know when that is done. Indeed, it would be very interesting to get benchmarks comparing to TensorKit.
Actually, here are some interesting results. I was inspired by Base._isperm
here: https://github.com/JuliaLang/julia/blob/master/base/combinatorics.jl#L71
If you replace the Vector
storage there with an MVector
, the results look quite good for tuples of length greater than 5:
using BenchmarkTools, Random, StaticArrays, TupleTools
function _isperm(A)
n = length(A)
used = @MVector zeros(Bool, n)
for a in A
(0 < a <= n) && (used[a] ⊻= true) || return false
end
true
end
function f(N)
t = Tuple(shuffle(1:N))
println("Length = $N")
println("TupleTools:")
@assert @btime TupleTools.isperm($(Ref(t))[])
println("Base:")
@assert @btime Base.isperm($(Ref(t))[])
println("_isperm:")
@assert @btime _isperm($(Ref(t))[])
println()
end
gives:
julia> f.(1:20)
Length = 1
TupleTools:
1.180 ns (0 allocations: 0 bytes)
Base:
1.208 ns (0 allocations: 0 bytes)
_isperm:
2.112 ns (0 allocations: 0 bytes)
Length = 2
TupleTools:
1.184 ns (0 allocations: 0 bytes)
Base:
1.184 ns (0 allocations: 0 bytes)
_isperm:
3.281 ns (0 allocations: 0 bytes)
Length = 3
TupleTools:
2.809 ns (0 allocations: 0 bytes)
Base:
2.063 ns (0 allocations: 0 bytes)
_isperm:
3.824 ns (0 allocations: 0 bytes)
Length = 4
TupleTools:
3.747 ns (0 allocations: 0 bytes)
Base:
2.110 ns (0 allocations: 0 bytes)
_isperm:
5.602 ns (0 allocations: 0 bytes)
Length = 5
TupleTools:
6.085 ns (0 allocations: 0 bytes)
Base:
4.482 ns (0 allocations: 0 bytes)
_isperm:
5.632 ns (0 allocations: 0 bytes)
Length = 6
TupleTools:
8.629 ns (0 allocations: 0 bytes)
Base:
7.035 ns (0 allocations: 0 bytes)
_isperm:
5.882 ns (0 allocations: 0 bytes)
Length = 7
TupleTools:
11.889 ns (0 allocations: 0 bytes)
Base:
9.592 ns (0 allocations: 0 bytes)
_isperm:
6.845 ns (0 allocations: 0 bytes)
Length = 8
TupleTools:
16.093 ns (0 allocations: 0 bytes)
Base:
9.562 ns (0 allocations: 0 bytes)
_isperm:
7.935 ns (0 allocations: 0 bytes)
Length = 9
TupleTools:
19.393 ns (0 allocations: 0 bytes)
Base:
14.973 ns (0 allocations: 0 bytes)
_isperm:
7.811 ns (0 allocations: 0 bytes)
Length = 10
TupleTools:
20.987 ns (0 allocations: 0 bytes)
Base:
16.943 ns (0 allocations: 0 bytes)
_isperm:
8.655 ns (0 allocations: 0 bytes)
Length = 11
TupleTools:
24.721 ns (0 allocations: 0 bytes)
Base:
20.900 ns (0 allocations: 0 bytes)
_isperm:
9.305 ns (0 allocations: 0 bytes)
Length = 12
TupleTools:
27.867 ns (0 allocations: 0 bytes)
Base:
19.587 ns (0 allocations: 0 bytes)
_isperm:
9.795 ns (0 allocations: 0 bytes)
Length = 13
TupleTools:
31.809 ns (0 allocations: 0 bytes)
Base:
39.250 ns (0 allocations: 0 bytes)
_isperm:
10.510 ns (0 allocations: 0 bytes)
Length = 14
TupleTools:
36.148 ns (0 allocations: 0 bytes)
Base:
44.129 ns (0 allocations: 0 bytes)
_isperm:
11.134 ns (0 allocations: 0 bytes)
Length = 15
TupleTools:
40.569 ns (0 allocations: 0 bytes)
Base:
53.399 ns (0 allocations: 0 bytes)
_isperm:
11.949 ns (0 allocations: 0 bytes)
Length = 16
TupleTools:
45.357 ns (0 allocations: 0 bytes)
Base:
57.564 ns (2 allocations: 128 bytes)
_isperm:
12.361 ns (0 allocations: 0 bytes)
Length = 17
TupleTools:
56.485 ns (0 allocations: 0 bytes)
Base:
61.537 ns (2 allocations: 128 bytes)
_isperm:
13.056 ns (0 allocations: 0 bytes)
Length = 18
TupleTools:
55.631 ns (0 allocations: 0 bytes)
Base:
61.557 ns (2 allocations: 128 bytes)
_isperm:
14.982 ns (0 allocations: 0 bytes)
Length = 19
TupleTools:
63.139 ns (0 allocations: 0 bytes)
Base:
66.175 ns (2 allocations: 128 bytes)
_isperm:
14.806 ns (0 allocations: 0 bytes)
Length = 20
TupleTools:
64.908 ns (0 allocations: 0 bytes)
Base:
67.971 ns (2 allocations: 128 bytes)
_isperm:
15.157 ns (0 allocations: 0 bytes)
I would have figured there would be some sort of tuple related slowdown working with larger MVectors, but that doesn't seem to be the case.
Looks promising indeed; I would prefer not to depend on the whole of StaticArrays.jl though. I can maybe mimic the basic MVector
functionality.
I've pushed some experimental change to TupleTools with a lightweight MVector
(rather, MutableNTuple{N,T}
) and implemented isperm
using this (like your _isperm
). So by using the master branch of TupleTools, you could easily test this in your benchmarks.
Thanks! I'll try it out. Feel free to close this, since it will be improved by the next version of TupleTools.
If you say it makes a difference, yes. I thought I would be able to speed up more methods using the MutableNTuple
type, but aside from probably sort
I am not actually sure. Though I guess it's a useful addition, so it's probably worth to keep it.
In my case, I ended up just avoiding calling permutedims
entirely, so this optimization wasn't really needed. Though I use TupleTools.isperm
elsewhere in NDTensors.jl
so I think it is still nice to have that as fast as possible. I think really the only case it would speed up in any noticeable way is something like this:
julia> using Strided
julia> A = reshape([2.0], ones(Int, 16)...)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 Array{Float64,16}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
2.0
julia> @btime @strided permutedims($A, $(Tuple(reverse(1:16))))
354.282 ns (4 allocations: 656 bytes)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 StridedView{Float64,16,Array{Float64,1},typeof(identity)}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
2.0
which could come up if you are taking a slice of a high order tensor. Of course I understand this is likely outside the scope of Strided
, and externally someone should check that length(A) == 1
and not call permutedims
(perhaps that logic could be in Strided
as well).
Hi @mtfishman , can this issue be closed or is there still something to be done here?
Looks like this can be closed now, thanks Jutho!
Here are some current timings:
julia> using Strided
julia> @btime TupleTools.isperm($(Ref(Tuple(reverse(1:16))))[])
13.221 ns (0 allocations: 0 bytes)
true
and:
julia> using Strided
julia> A = reshape([2.0], ones(Int, 16)...)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 Array{Float64, 16}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
2.0
julia> @btime @strided permutedims($A, $(Ref(Tuple(reverse(1:16))))[])
173.989 ns (2 allocations: 80 bytes)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 StridedView{Float64, 16, Vector{Float64}, typeof(identity)}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
2.0
Hi Jutho,
Thanks for the nice package.
It looks like
permutedims(::StridedView, p)
makes use ofTupleTools.isperm
. However, it appears thatBase.isperm
may be faster:I'm guessing
Base.isperm
was not always so good, which is why you wrote your own version. This showed up in ITensors.jl. I am benchmarking operations involving high order tensors with many very small blocks, andTupleTools.isperm
is getting called many times and contributing a nontrivial amount of time.On a side note, we've been doing some benchmarking comparing C++ ITensor to ITensors.jl and have seen that our new Julia version is generally faster, and we believe it is in no small part because of Strided.jl!
Cheers, Matt