JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.46k stars 5.46k forks source link

Set operation (`setdiff`, `union`, `intersect`, `symdiff`) performance issues #40501

Open mtfishman opened 3 years ago

mtfishman commented 3 years ago

EDIT: This has morphed into a more general discussion about set operation performance, beyond the initial issue presented.

I'm finding that setdiff is slightly slower with exactly 2 inputs:

julia> versioninfo()
Julia Version 1.6.0
Commit f9720dc2eb (2021-03-24 12:55 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = vim

julia> @btime setdiff($([1, 2, 3]), $([1, 2]))
  284.924 ns (7 allocations: 624 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]))
  158.302 ns (5 allocations: 592 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]), $([]))
  167.004 ns (6 allocations: 608 bytes)
1-element Vector{Int64}:
 3

and the same on nightly:

julia> versioninfo()
Julia Version 1.7.0-DEV.931
Commit 29d5158d27* (2021-04-15 20:13 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-11.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = vim

julia> @btime setdiff($([1, 2, 3]), $([1, 2]))
  319.466 ns (8 allocations: 640 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]))
  154.513 ns (5 allocations: 592 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]), $([]))
  166.866 ns (6 allocations: 608 bytes)
1-element Vector{Int64}:
 3

It's not the case for Set inputs, so is probably caused by the generic code here. I've been staring at that code trying to figure out what would be special about inputting 2 Vectors and can't figure it out...

oscardssmith commented 2 years ago

I'm not able to reproduce this, but setdiff does appear to be significantly slower than the one in numpy which we should fix https://stackoverflow.com/questions/70766215/problem-with-memory-allocation-in-julia-code

mtfishman commented 2 years ago

Interesting, I see this in Julia 1.7.1:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> @btime setdiff($([1, 2, 3]), $([1, 2]))
  329.487 ns (8 allocations: 528 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]))
  167.212 ns (5 allocations: 480 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]), $([]))
  181.436 ns (6 allocations: 496 bytes)
1-element Vector{Int64}:
 3

but it looks like it is now fixed on nightly:

julia> versioninfo()
Julia Version 1.8.0-DEV.1342
Commit 1f266b895e (2022-01-18 19:22 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  LD_PRELOAD = libgtk3-nocsd.so.0
  LD_LIBRARY_PATH = /usr/lib/x86_64-linux-gnu/libcutensor/11.1/:/opt/intel/compilers_and_libraries_2020.0.166/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2020.0.166/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin:/usr/local/cuda-11.1/lib64::/home/mfishman/software/itensor/lib
  JULIA_EDITOR = vim

julia> @btime setdiff($([1, 2, 3]), $([1, 2]))
  191.334 ns (6 allocations: 544 bytes)
1-element Vector{Int64}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]))
  234.517 ns (6 allocations: 512 bytes)
1-element Vector{Any}:
 3

julia> @btime setdiff($([1, 2, 3]), $([1, 2]), $([]), $([]))
  244.295 ns (7 allocations: 528 bytes)
1-element Vector{Any}:
 3

@oscardssmith, I'm not sure if this issue is related. For larger sets, it only leads to a small slowdown:

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)

julia> @btime setdiff($(1:50), $(1:49))
  1.221 μs (14 allocations: 2.27 KiB)
1-element Vector{Int64}:
 50

julia> @btime setdiff($(1:50), $(1:49), $([]))
  1.097 μs (10 allocations: 2.18 KiB)
1-element Vector{Int64}:
 50

julia> @btime setdiff($(1:100), $(1:99))
  2.242 μs (14 allocations: 3.79 KiB)
1-element Vector{Int64}:
 100

julia> @btime setdiff($(1:100), $(1:99), $([]))
  2.109 μs (10 allocations: 3.70 KiB)
1-element Vector{Int64}:
 100

Interestingly, it looks like the two-input versions speed up on nightly, but the three-input version slows down:

julia> versioninfo()
Julia Version 1.8.0-DEV.1342
Commit 1f266b895e (2022-01-18 19:22 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-13.0.0 (ORCJIT, skylake)
Environment:
  LD_PRELOAD = libgtk3-nocsd.so.0
  LD_LIBRARY_PATH = /usr/lib/x86_64-linux-gnu/libcutensor/11.1/:/opt/intel/compilers_and_libraries_2020.0.166/linux/tbb/lib/intel64_lin/gcc4.7:/opt/intel/compilers_and_libraries_2020.0.166/linux/compiler/lib/intel64_lin:/opt/intel/compilers_and_libraries_2020.0.166/linux/mkl/lib/intel64_lin:/usr/local/cuda-11.1/lib64::/home/mfishman/software/itensor/lib
  JULIA_EDITOR = vim

julia> @btime setdiff($(1:50), $(1:49))
  938.607 ns (9 allocations: 1.74 KiB)
1-element Vector{Int64}:
 50

julia> @btime setdiff($(1:50), $(1:49), $([]))
  1.850 μs (10 allocations: 1.74 KiB)
1-element Vector{Any}:
 50

julia> @btime setdiff($(1:100), $(1:99))
  1.861 μs (9 allocations: 2.87 KiB)
1-element Vector{Int64}:
 100

julia> @btime setdiff($(1:100), $(1:99), $([]))
  3.833 μs (10 allocations: 2.87 KiB)
1-element Vector{Any}:
 100

Code reorganizations like the one proposed by @bkamins may be more important in that example.

mtfishman commented 2 years ago

In terms of performance of the algorithms themselves, I found that in practice for small sets (maybe the crossover is around 10), it is faster to avoid converting to Set and instead just use Vector directly. This avoids the expense of converting to Set, at the cost of slower search through the elements. I have some code for that here: https://github.com/ITensor/ITensors.jl/blob/bdfbd3c9a3d64dc97ca04c39ef597b766142f095/src/set_operations.jl

Perhaps something like that would be of interest in Base, or in a separate package.

oscardssmith commented 2 years ago

Oh. I was looking on nightly.

oscardssmith commented 2 years ago

I think the solution might just be to do a vector based implementation for small sets. That would belong in Base

bkamins commented 2 years ago

Thank you for opening this issue (I wanted to do it after the SO answer, but had some other things to do today).

I think we should also benchmark also cases of large setdiff (on SO they are up to 10,000 elements and also seemed to be slow).

Also we could add a special case for setdiff between integer vectors, where we could check in what cases it is worth to check their extrema and use the mapping similar to what I did on SO (also like BitSet).

mtfishman commented 2 years ago

I've made a gist of implementations of Julia set operations (setdiff, intersect, symdiff, union) which avoid converting to Set: https://gist.github.com/mtfishman/2b75ac7d4d047e5be97b693a39127a60

Here are some results:

julia> using BenchmarkTools

julia> include("small_set_operations.jl")
_union! (generic function with 2 methods)

julia> versioninfo()
Julia Version 1.7.1
Commit ac5cc99908 (2021-12-22 19:35 UTC)
Platform Info:
  OS: Linux (x86_64-pc-linux-gnu)
  CPU: Intel(R) Xeon(R) E-2176M  CPU @ 2.70GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-12.0.1 (ORCJIT, skylake)
Environment:
  JULIA_EDITOR = vim

julia> @btime setdiff($([1, 2, 3, 4]), $([1, 2, 3]))
  338.877 ns (7 allocations: 528 bytes)
1-element Vector{Int64}:
 4

julia> @btime _setdiff($([1, 2, 3, 4]), $([1, 2, 3]))
  51.320 ns (1 allocation: 96 bytes)
1-element Vector{Int64}:
 4

julia> @btime intersect($([1, 2, 3, 4]), $([1, 2, 3]))
  484.041 ns (12 allocations: 944 bytes)
3-element Vector{Int64}:
 1
 2
 3

julia> @btime _intersect($([1, 2, 3, 4]), $([1, 2, 3]))
  57.184 ns (2 allocations: 144 bytes)
3-element Vector{Int64}:
 1
 2
 3

julia> @btime symdiff($([1, 2, 3, 4]), $([1, 2, 3]))
  222.391 ns (6 allocations: 544 bytes)
1-element Vector{Int64}:
 4

julia> @btime _symdiff($([1, 2, 3, 4]), $([1, 2, 3]))
  68.136 ns (2 allocations: 144 bytes)
1-element Vector{Int64}:
 4

julia> @btime union($([1, 2, 3, 4]), $([1, 2, 3]))
  188.991 ns (6 allocations: 544 bytes)
4-element Vector{Int64}:
 1
 2
 3
 4

julia> @btime _union($([1, 2, 3, 4]), $([1, 2, 3]))
  86.902 ns (3 allocations: 208 bytes)
4-element Vector{Int64}:
 1
 2
 3
 4

I think a downside of this approach is that it doesn't handle repeated elements (input vectors with nonunique vectors) in the same way as the versions which convert back and forth from Set, but maybe those can be handled in a special way.

EDIT: And here are some results using BitSet:

julia> @btime setdiff($(BitSet([1, 2, 3, 4])), $(BitSet([1, 2, 3])))
  68.885 ns (3 allocations: 176 bytes)
BitSet with 1 element:
  4

julia> @btime intersect($(BitSet([1, 2, 3, 4])), $(BitSet([1, 2, 3])))
  71.939 ns (3 allocations: 176 bytes)
BitSet with 3 elements:
  1
  2
  3

julia> @btime symdiff($(BitSet([1, 2, 3, 4])), $(BitSet([1, 2, 3])))
  68.020 ns (3 allocations: 176 bytes)
BitSet with 1 element:
  4

julia> @btime union($(BitSet([1, 2, 3, 4])), $(BitSet([1, 2, 3])))
  68.975 ns (3 allocations: 176 bytes)
BitSet with 4 elements:
  1
  2
  3
  4

Obviously these results will depend strongly on the set sizes involved.

mtfishman commented 2 years ago

I haven't investigated the large set limit at all, but I'm curious if Sets based on Dictionaries.jl could help in that limit, I've found that for things like iteration it is faster than Base.Dict.

oscardssmith commented 2 years ago

Oh, this is probably one of the cases where it would be a lot better if we had https://github.com/JuliaLang/julia/pull/10116

mtfishman commented 2 years ago

At some point I tried benchmarking various available dictionary/set options for implementing set operations on small sets:

I didn't do a very systematic job and can't find the code now, but in the end for small sets I concluded it was best to just roll my own version which just directly used Vector (which heavily made use of Base definitions that use Set).

It looks like Dictionaries.jl has some relevant benchmarks of set operations with various data structures. Giving it a quick glance, it looks like Dictionaries.ArrayIndices is fastest for small sets and Dictionaries.Indices is fastest for large sets. @andyferris, do you mind jumping in and commenting if that sounds right to you?

andyferris commented 2 years ago

Yes, correct, that’s the intention with those two types. Note that sometimes Set is a bit faster than Indices, for certain workloads.

I thought OrderedCollections split out the code from DataStructures that was originally #10116? (Maybe I am confused on this point).

mtfishman commented 2 years ago

Great, thanks for confirming. Looks like I could probably just use Dictionaries.ArrayIndices for my small set applications.

Sounds like that is correct: https://github.com/JuliaLang/julia/pull/10116#issuecomment-948922677