Jutho / Strided.jl

A Julia package for strided array views and efficient manipulations thereof
Other
150 stars 13 forks source link

Move from TupleTools.isperm to Base.isperm #15

Closed mtfishman closed 2 years ago

mtfishman commented 3 years ago

Hi Jutho,

Thanks for the nice package.

It looks like permutedims(::StridedView, p) makes use of TupleTools.isperm. However, it appears that Base.isperm may be faster:

julia> versioninfo()
Julia Version 1.5.2
Commit 539f3ce943 (2020-09-23 23:17 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> using TupleTools, Random, BenchmarkTools

(@v1.5) pkg> st TupleTools
Status `~/.julia/environments/v1.5/Project.toml`
  [9d95972d] TupleTools v1.2.0

julia> t = Tuple(shuffle(1:5))
(5, 1, 3, 4, 2)

julia> @btime TupleTools.isperm($t)
  4.235 ns (0 allocations: 0 bytes)
true

julia> @btime Base.isperm($t)
  0.015 ns (0 allocations: 0 bytes)
true

julia> t = Tuple(shuffle(1:8))
(1, 7, 6, 3, 8, 2, 5, 4)

julia> @btime TupleTools.isperm($t)
  18.206 ns (0 allocations: 0 bytes)
true

julia> @btime Base.isperm($t)
  0.015 ns (0 allocations: 0 bytes)
true

I'm guessing Base.isperm was not always so good, which is why you wrote your own version. This showed up in ITensors.jl. I am benchmarking operations involving high order tensors with many very small blocks, and TupleTools.isperm is getting called many times and contributing a nontrivial amount of time.

On a side note, we've been doing some benchmarking comparing C++ ITensor to ITensors.jl and have seen that our new Julia version is generally faster, and we believe it is in no small part because of Strided.jl!

Cheers, Matt

mtfishman commented 3 years ago

It looks like there is a performance crossover for Tuples of length 16:

julia> t = Tuple(shuffle(1:15))
(4, 3, 2, 8, 14, 9, 11, 13, 12, 7, 6, 5, 1, 15, 10)

julia> @btime TupleTools.isperm($t)
  43.287 ns (0 allocations: 0 bytes)
true

julia> @btime Base.isperm($t)
  0.015 ns (0 allocations: 0 bytes)
true

julia> t = Tuple(shuffle(1:16))
(13, 4, 3, 1, 5, 14, 15, 6, 9, 16, 8, 2, 7, 10, 12, 11)

julia> @btime TupleTools.isperm($t)
  41.737 ns (0 allocations: 0 bytes)
true

julia> @btime Base.isperm($t)
  56.686 ns (2 allocations: 128 bytes)
true

Probably hitting some inference issue. Still, the base version is generally faster. Perhaps TupleTools.isperm could use Base.isperm for Tuples of length 15 or less and use the current version for tuples of length 16 or more. I could make a PR if that sounds reasonable to you.

Jutho commented 3 years ago

Hi Matt, thanks for the nice comments regarding Strided. I can certainly accept that Base.isperm is faster than the one from TupleTools in certain regimes, but I think there is something off with your timings. Sub-nanosecond timings illustrate that something is wrong. I think that for Base.isperm in your case constant folding kicks in, and so everything is optimized away within what is being timed. There was indeed a PR in Julia Base to implement isperm for tuples, where advanced analysis was performed and I think this problem also came up there. Let me see if I can recover the PR. In particular, I would expect the implementation from Base to be faster at least for larger tuples, but maybe also for smaller ones.

Jutho commented 3 years ago

This is the one; I am too tired to read the discussion right now, but I will take a look tomorrow to refresh my memory: https://github.com/JuliaLang/julia/pull/35234

mtfishman commented 3 years ago

A right, I forgot about that. Here are some bechmarks using the Ref trick (https://github.com/JuliaCI/BenchmarkTools.jl/blob/master/doc/manual.md#understanding-compiler-optimizations):

julia> function f(N)
         t = Tuple(shuffle(1:N))
         println("Length = $N")
         println("TupleTools:")
         @btime TupleTools.isperm($(Ref(t))[])
         println("Base:")
         @btime Base.isperm($(Ref(t))[])
         println()
       end
f (generic function with 1 method)

julia> function f(N)
         t = Tuple(shuffle(1:N))
         println("Length = $N")
         println("TupleTools:")
         @btime TupleTools.isperm($(Ref(t))[])
         println("Base:")
         @btime Base.isperm($(Ref(t))[])
         println()
       end
f (generic function with 1 method)

julia> for n in 1:20 f(n) end
Length = 1
TupleTools:
  1.179 ns (0 allocations: 0 bytes)
Base:
  1.179 ns (0 allocations: 0 bytes)

Length = 2
TupleTools:
  1.181 ns (0 allocations: 0 bytes)
Base:
  1.181 ns (0 allocations: 0 bytes)

Length = 3
TupleTools:
  2.811 ns (0 allocations: 0 bytes)
Base:
  2.110 ns (0 allocations: 0 bytes)

Length = 4
TupleTools:
  4.679 ns (0 allocations: 0 bytes)
Base:
  2.112 ns (0 allocations: 0 bytes)

Length = 5
TupleTools:
  5.608 ns (0 allocations: 0 bytes)
Base:
  4.482 ns (0 allocations: 0 bytes)

Length = 6
TupleTools:
  8.437 ns (0 allocations: 0 bytes)
Base:
  7.033 ns (0 allocations: 0 bytes)

Length = 7
TupleTools:
  12.142 ns (0 allocations: 0 bytes)
Base:
  9.596 ns (0 allocations: 0 bytes)

Length = 8
TupleTools:
  16.163 ns (0 allocations: 0 bytes)
Base:
  9.561 ns (0 allocations: 0 bytes)

Length = 9
TupleTools:
  20.294 ns (0 allocations: 0 bytes)
Base:
  15.329 ns (0 allocations: 0 bytes)

Length = 10
TupleTools:
  21.488 ns (0 allocations: 0 bytes)
Base:
  17.359 ns (0 allocations: 0 bytes)

Length = 11
TupleTools:
  25.075 ns (0 allocations: 0 bytes)
Base:
  21.084 ns (0 allocations: 0 bytes)

Length = 12
TupleTools:
  28.421 ns (0 allocations: 0 bytes)
Base:
  19.587 ns (0 allocations: 0 bytes)

Length = 13
TupleTools:
  28.816 ns (0 allocations: 0 bytes)
Base:
  40.336 ns (0 allocations: 0 bytes)

Length = 14
TupleTools:
  36.251 ns (0 allocations: 0 bytes)
Base:
  44.136 ns (0 allocations: 0 bytes)

Length = 15
TupleTools:
  39.280 ns (0 allocations: 0 bytes)
Base:
  51.793 ns (0 allocations: 0 bytes)

Length = 16
TupleTools:
  45.356 ns (0 allocations: 0 bytes)
Base:
  58.012 ns (2 allocations: 128 bytes)

Length = 17
TupleTools:
  47.538 ns (0 allocations: 0 bytes)
Base:
  60.237 ns (2 allocations: 128 bytes)

Length = 18
TupleTools:
  55.415 ns (0 allocations: 0 bytes)
Base:
  62.752 ns (2 allocations: 128 bytes)

Length = 19
TupleTools:
  58.755 ns (0 allocations: 0 bytes)
Base:
  63.228 ns (2 allocations: 128 bytes)

Length = 20
TupleTools:
  66.417 ns (0 allocations: 0 bytes)
Base:
  65.484 ns (2 allocations: 128 bytes)

Does that look reasonable? It looks like the picture is not so clear.

Jutho commented 3 years ago

Ok, very interesting timings. Quite strange indeed, which Julia version is this? I would have expected the TupleTools version to become slow at some point, when inference gives up on the recursion in the TupleTools implementation.

Anyway, you can maybe try to just replace the isperm call in Strided with Base.isperm and see how this affect your benchmarks? I am interested in learning some more of your C++ versus Julia ITensor benchmarks, and would of course also like to compare with my own TensorKit package.

mtfishman commented 3 years ago

Those were with Julia v1.5.2.

I don't see any recursion in TupleTools.isperm: https://github.com/Jutho/TupleTools.jl/blob/master/src/TupleTools.jl#L287

It looks like Base.isperm special cases length 2, then uses recursion up to length 15, then uses a loop after that: https://github.com/JuliaLang/julia/blob/master/base/combinatorics.jl#L84

I tried replacing Base.isperm in Strided and it didn't make much difference in my benchmarks. The tensors are order 12 with blocks of size 1 so from the benchmark above it should have been a bit faster, but it wasn't noticeable. Anyway, over all they are close enough that I don't think it makes much of a difference whether TupleTools.isperm or Base.isperm is used. In any more realistic operations (probably anything with arrays larger than 1), the actual permutation will dominate over the cost of isperm, and I could just make a special case in my code to not do a permutation if the block is size 1.

We are in the process of finishing up some benchmarks for the ITensor paper, I can let you know when that is done. Indeed, it would be very interesting to get benchmarks comparing to TensorKit.

mtfishman commented 3 years ago

Actually, here are some interesting results. I was inspired by Base._isperm here: https://github.com/JuliaLang/julia/blob/master/base/combinatorics.jl#L71

If you replace the Vector storage there with an MVector, the results look quite good for tuples of length greater than 5:

using BenchmarkTools, Random, StaticArrays, TupleTools

function _isperm(A)
  n = length(A)
  used = @MVector zeros(Bool, n)
  for a in A 
    (0 < a <= n) && (used[a] ⊻= true) || return false
  end
  true
end

function f(N)
  t = Tuple(shuffle(1:N))
  println("Length = $N")
  println("TupleTools:")
  @assert @btime TupleTools.isperm($(Ref(t))[])
  println("Base:")
  @assert @btime Base.isperm($(Ref(t))[])
  println("_isperm:")
  @assert @btime _isperm($(Ref(t))[])
  println()
end

gives:

julia> f.(1:20)
Length = 1
TupleTools:
  1.180 ns (0 allocations: 0 bytes)
Base:
  1.208 ns (0 allocations: 0 bytes)
_isperm:
  2.112 ns (0 allocations: 0 bytes)

Length = 2
TupleTools:
  1.184 ns (0 allocations: 0 bytes)
Base:
  1.184 ns (0 allocations: 0 bytes)
_isperm:
  3.281 ns (0 allocations: 0 bytes)

Length = 3
TupleTools:
  2.809 ns (0 allocations: 0 bytes)
Base:
  2.063 ns (0 allocations: 0 bytes)
_isperm:
  3.824 ns (0 allocations: 0 bytes)

Length = 4
TupleTools:
  3.747 ns (0 allocations: 0 bytes)
Base:
  2.110 ns (0 allocations: 0 bytes)
_isperm:
  5.602 ns (0 allocations: 0 bytes)

Length = 5
TupleTools:
  6.085 ns (0 allocations: 0 bytes)
Base:
  4.482 ns (0 allocations: 0 bytes)
_isperm:
  5.632 ns (0 allocations: 0 bytes)

Length = 6
TupleTools:
  8.629 ns (0 allocations: 0 bytes)
Base:
  7.035 ns (0 allocations: 0 bytes)
_isperm:
  5.882 ns (0 allocations: 0 bytes)

Length = 7
TupleTools:
  11.889 ns (0 allocations: 0 bytes)
Base:
  9.592 ns (0 allocations: 0 bytes)
_isperm:
  6.845 ns (0 allocations: 0 bytes)

Length = 8
TupleTools:
  16.093 ns (0 allocations: 0 bytes)
Base:
  9.562 ns (0 allocations: 0 bytes)
_isperm:
  7.935 ns (0 allocations: 0 bytes)

Length = 9
TupleTools:
  19.393 ns (0 allocations: 0 bytes)
Base:
  14.973 ns (0 allocations: 0 bytes)
_isperm:
  7.811 ns (0 allocations: 0 bytes)

Length = 10
TupleTools:
  20.987 ns (0 allocations: 0 bytes)
Base:
  16.943 ns (0 allocations: 0 bytes)
_isperm:
  8.655 ns (0 allocations: 0 bytes)

Length = 11
TupleTools:
  24.721 ns (0 allocations: 0 bytes)
Base:
  20.900 ns (0 allocations: 0 bytes)
_isperm:
  9.305 ns (0 allocations: 0 bytes)

Length = 12
TupleTools:
  27.867 ns (0 allocations: 0 bytes)
Base:
  19.587 ns (0 allocations: 0 bytes)
_isperm:
  9.795 ns (0 allocations: 0 bytes)

Length = 13
TupleTools:
  31.809 ns (0 allocations: 0 bytes)
Base:
  39.250 ns (0 allocations: 0 bytes)
_isperm:
  10.510 ns (0 allocations: 0 bytes)

Length = 14
TupleTools:
  36.148 ns (0 allocations: 0 bytes)
Base:
  44.129 ns (0 allocations: 0 bytes)
_isperm:
  11.134 ns (0 allocations: 0 bytes)

Length = 15
TupleTools:
  40.569 ns (0 allocations: 0 bytes)
Base:
  53.399 ns (0 allocations: 0 bytes)
_isperm:
  11.949 ns (0 allocations: 0 bytes)

Length = 16
TupleTools:
  45.357 ns (0 allocations: 0 bytes)
Base:
  57.564 ns (2 allocations: 128 bytes)
_isperm:
  12.361 ns (0 allocations: 0 bytes)

Length = 17
TupleTools:
  56.485 ns (0 allocations: 0 bytes)
Base:
  61.537 ns (2 allocations: 128 bytes)
_isperm:
  13.056 ns (0 allocations: 0 bytes)

Length = 18
TupleTools:
  55.631 ns (0 allocations: 0 bytes)
Base:
  61.557 ns (2 allocations: 128 bytes)
_isperm:
  14.982 ns (0 allocations: 0 bytes)

Length = 19
TupleTools:
  63.139 ns (0 allocations: 0 bytes)
Base:
  66.175 ns (2 allocations: 128 bytes)
_isperm:
  14.806 ns (0 allocations: 0 bytes)

Length = 20
TupleTools:
  64.908 ns (0 allocations: 0 bytes)
Base:
  67.971 ns (2 allocations: 128 bytes)
_isperm:
  15.157 ns (0 allocations: 0 bytes)

I would have figured there would be some sort of tuple related slowdown working with larger MVectors, but that doesn't seem to be the case.

Jutho commented 3 years ago

Looks promising indeed; I would prefer not to depend on the whole of StaticArrays.jl though. I can maybe mimic the basic MVector functionality.

Jutho commented 3 years ago

I've pushed some experimental change to TupleTools with a lightweight MVector (rather, MutableNTuple{N,T}) and implemented isperm using this (like your _isperm). So by using the master branch of TupleTools, you could easily test this in your benchmarks.

mtfishman commented 3 years ago

Thanks! I'll try it out. Feel free to close this, since it will be improved by the next version of TupleTools.

Jutho commented 3 years ago

If you say it makes a difference, yes. I thought I would be able to speed up more methods using the MutableNTuple type, but aside from probably sort I am not actually sure. Though I guess it's a useful addition, so it's probably worth to keep it.

mtfishman commented 3 years ago

In my case, I ended up just avoiding calling permutedims entirely, so this optimization wasn't really needed. Though I use TupleTools.isperm elsewhere in NDTensors.jl so I think it is still nice to have that as fast as possible. I think really the only case it would speed up in any noticeable way is something like this:

julia> using Strided

julia> A = reshape([2.0], ones(Int, 16)...)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 Array{Float64,16}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
 2.0

julia> @btime @strided permutedims($A, $(Tuple(reverse(1:16))))
  354.282 ns (4 allocations: 656 bytes)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 StridedView{Float64,16,Array{Float64,1},typeof(identity)}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
 2.0

which could come up if you are taking a slice of a high order tensor. Of course I understand this is likely outside the scope of Strided, and externally someone should check that length(A) == 1 and not call permutedims (perhaps that logic could be in Strided as well).

Jutho commented 2 years ago

Hi @mtfishman , can this issue be closed or is there still something to be done here?

mtfishman commented 2 years ago

Looks like this can be closed now, thanks Jutho!

Here are some current timings:

julia> using Strided

julia> @btime TupleTools.isperm($(Ref(Tuple(reverse(1:16))))[])
  13.221 ns (0 allocations: 0 bytes)
true

and:

julia> using Strided

julia> A = reshape([2.0], ones(Int, 16)...)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 Array{Float64, 16}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
 2.0

julia> @btime @strided permutedims($A, $(Ref(Tuple(reverse(1:16))))[])
  173.989 ns (2 allocations: 80 bytes)
1×1×1×1×1×1×1×1×1×1×1×1×1×1×1×1 StridedView{Float64, 16, Vector{Float64}, typeof(identity)}:
[:, :, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1] =
 2.0