JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.32k stars 5.45k forks source link

regression: occursin() and startswith() slow on large tuples #49671

Open hhaensel opened 1 year ago

hhaensel commented 1 year ago

Version Info

julia> versioninfo()
Julia Version 1.9.0-rc3
Commit 1853b90328 (2023-04-26 15:51 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: 16 × 11th Gen Intel(R) Core(TM) i7-11800H @ 2.30GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-14.0.6 (ORCJIT, tigerlake)
  Threads: 16 on 16 virtual cores
Environment:
  JULIA_NUM_THREADS = auto

My Installation

Julia is installed via juliaup

Observation

The following code is an order of magnitude slower on v1.90-rc3 than on v1.8.5

v1.9.0-rc3

julia> t = Tuple("x$(lpad(n, 5, '0'))" for n in 1:100)
("x00001", "x00002", "x00003", "x00004", "x00005", "x00006", "x00007", "x00008", "x00009", "x00010", "x00011", "x00012", "x00013", "x00014", "x00015", "x00016", "x00017", "x00018", "x00019", "x00020", "x00021", "x00022", "x00023", "x00024", "x00025", "x00026", "x00027", "x00028", "x00029", "x00030", "x00031", "x00032", "x00033", "x00034", "x00035", "x00036", "x00037", "x00038", "x00039", "x00040", "x00041", "x00042", "x00043", "x00044", "x00045", "x00046", "x00047", "x00048", "x00049", "x00050", "x00051", "x00052", "x00053", "x00054", "x00055", "x00056", "x00057", "x00058", "x00059", "x00060", "x00061", "x00062", "x00063", "x00064", "x00065", "x00066", "x00067", "x00068", "x00069", "x00070", "x00071", "x00072", "x00073", "x00074", "x00075", "x00076", "x00077", "x00078", "x00079", "x00080", "x00081", "x00082", "x00083", "x00084", "x00085", "x00086", "x00087", "x00088", "x00089", "x00090", "x00091", "x00092", "x00093", "x00094", "x00095", "x00096", "x00097", "x00098", "x00099", "x00100")

julia> @time occursin.("_", t) .& startswith.(t, "_")
 33.161712 seconds (77.48 k allocations: 5.093 MiB, 0.07% gc time, 99.99% compilation time)
(false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false)

v1.8.5

julia> t = Tuple("x$(lpad(n, 5, '0'))" for n in 1:100)
("x00001", "x00002", "x00003", "x00004", "x00005", "x00006", "x00007", "x00008", "x00009", "x00010", "x00011", "x00012", "x00013", "x00014", "x00015", "x00016", "x00017", "x00018", "x00019", "x00020", "x00021", "x00022", "x00023", "x00024", "x00025", "x00026", "x00027", "x00028", "x00029", "x00030", "x00031", "x00032", "x00033", "x00034", "x00035", "x00036", "x00037", "x00038", "x00039", "x00040", "x00041", "x00042", "x00043", "x00044", "x00045", "x00046", "x00047", "x00048", "x00049", "x00050", "x00051", "x00052", "x00053", "x00054", "x00055", "x00056", "x00057", "x00058", "x00059", "x00060", "x00061", "x00062", "x00063", "x00064", "x00065", "x00066", "x00067", "x00068", "x00069", "x00070", "x00071", "x00072", "x00073", "x00074", "x00075", "x00076", "x00077", "x00078", "x00079", "x00080", "x00081", "x00082", "x00083", "x00084", "x00085", "x00086", "x00087", "x00088", "x00089", "x00090", "x00091", "x00092", "x00093", "x00094", "x00095", "x00096", "x00097", "x00098", "x00099", "x00100")

julia> @time occursin.("_", t) .& startswith.(t, "_")
  9.872833 seconds (127.62 k allocations: 7.099 MiB, 99.97% compilation time)
(false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false, false)
giordano commented 1 year ago

The following code is an order of magnitude slower

The ratio looks to me closer to 3 than ~10, which would justify "an order of magnitude".

hhaensel commented 1 year ago

True, I first tested with a tuple of approx 200 and there it was an order of magnitude. Then I made the MWE and the ratio dropped

barucden commented 1 year ago

Just to add to that: at least on my machine, the compilation is slower on v1.9 but the runtime is slightly faster:

julia> @time occursin.("_", t) .& startswith.(t, "_");
  # 1.8.5
  4.021965 seconds (132.36 k allocations: 7.493 MiB, 99.97% compilation time)
  # 1.9.0-rc3
 24.180451 seconds (80.60 k allocations: 5.439 MiB, 100.00% compilation time)

julia> @btime occursin.("_", t) .& startswith.(t, "_");
  # 1.8.5
  25.418 μs (9 allocations: 6.52 KiB)
  # 1.9.0-rc3
  16.245 μs (9 allocations: 6.52 KiB)
christiangnrd commented 1 year ago

Offending commit is the upgrade from LLVM 13 to LLVM 14. I also see increased compile but better performance on my linux machines, but on my apple silicon laptop, I see identical performance but increased compile times.

giordano commented 1 year ago

Well, then probably not much that can be done about this on the Julia side if just changing version of LLVM negatively affects compilation time (while positively affecting runtime).

hhaensel commented 1 year ago

Maybe we put a Caveat in the "What's New?" section?

giordano commented 1 year ago

A caveat about your own specific case? It isn't like all compilation regressed, this is just a very specific and unfortunate case, it'll always happen that some compilation times go down, other go up, it's just how LLVM is.

hhaensel commented 1 year ago

Not a specific CAVEAT, of course. One would need to dig a bit deeper whether this is common for tuple handling. In that case one could mention that broadcasting over large tuples produces larger compile times. The workaround would be to switch to arrays. For our MWE I find:

julia> a = collect(t);

julia> @time occursin.("_", a) .& startswith.(a, "_")
  0.155513 seconds (242.44 k allocations: 16.160 MiB, 99.58% compilation time)

julia> @btime occursin.("_", a) .& startswith.(a, "_")
  2.925 μs (10 allocations: 368 bytes)
adienes commented 1 year ago

@hhaensel in your full example are you hitting this compilation cost a large number of times or is it a one-time cost of 150ms?

hhaensel commented 1 year ago

I was hitting compilation several times, as seemingly broadcasting over NTuples compiles different functions for different numbers of N. And I was not hitting the compile time of 150ms but the compile time of 20-40s.

I read more about the differences of tuples and vectors (e.g. https://discourse.julialang.org/t/when-to-use-tuples/18675/18) to understand the underlying principles. Although it's getting a bit off-topic, I want to share my basic learning (also to have it verified by the more professional julia guys here ;-) )

julia> t = Tuple(Symbol("x$(lpad(n, 5, '0'))") for n in 1:100);

julia> @time String.(t);
  0.230550 seconds (529.87 k allocations: 34.547 MiB, 16.87% gc time, 98.88% compilation time)

julia> @time tuple(String(f) for f in t)
  0.026668 seconds (2.04 k allocations: 139.259 KiB, 97.73% compilation time)

I wonder whether compiling of NTuples couldn't be better optimized so that the compiled methods are not specialised on N. Even collect(::NTuple{N, T}) is newly compiled for each N. That's very surprising to me.

hhaensel commented 1 year ago

A caveat about your own specific case? It isn't like all compilation regressed, this is just a very specific and unfortunate case, it'll always happen that some compilation times go down, other go up, it's just how LLVM is.

Actually, I think it is a very general case. Sorry for this long post ...

Today I stumbled across FilePathsBase, where I observe that:

using FilePathsBase

t1 = ("hello", "world")
t2 = ("hello", "world", "foo")

@time WindowsPath(t1)
#   0.019310 seconds (27.02 k allocations: 1.816 MiB, 99.81% compilation time)
# p"hello/world"

@time WindowsPath(t2)
#   0.019089 seconds (27.80 k allocations: 1.861 MiB, 99.82% compilation time)
# p"hello/world/foo"

Similarly, PosixPath() is recompiled for each different number of elements, although only the constructor is called.

One might argue, that it was not a wise decision to store segments in Tuples, but I'm quite sure that little people know about this detail.

But let's come have a closer look at WindowsPath and switch the storage type to a Vector() and even spend some @nospecialises

struct WindowsPath2 <: FilePathsBase.SystemPath
    segments::Vector{String}
    root::String
    drive::String
    separator::String

    function WindowsPath2(
        @nospecialize(segments::Union{Vector, Tuple}), 
        root::String,
        drive::String,
        separator::String=FilePathsBase.WIN_PATH_SEPARATOR
    )

        return new(filter!(!isempty, collect(segments)), root, drive, separator)
    end
end

function WindowsPath2(@nospecialize segments::Union{Vector, Tuple}; root="", drive="", separator="\\")
    return WindowsPath2(segments, root, drive, separator)
end

We see that WindowsPath2 is still recompiled:

@time WindowsPath(t1)
#   0.039903 seconds (34.32 k allocations: 2.294 MiB, 99.82% compilation time)
# p"hello\world"

@time WindowsPath(t2)
#  0.012491 seconds (17.02 k allocations: 1.120 MiB, 99.68% compilation time)
# p"hello\world\foo"

The reason is that, as mentioned above, collect() is as well recompiled for each new type. I find this quite unfortunate, as conversion between NTuples and Vectors should be really optimised in Julia.

With the following lines we can, however, remedy recompilation for collect()

function Base.collect(::Type{T}, @nospecialize(t::Tuple)) where T
    res = Vector{T}(undef, length(t))
    i = 1
    for x in t # in eachindex(t)
        res[i] = x #t[i]
        i += 1
    end
    res
end

function Base.collect(@nospecialize t::Tuple)
    if isempty(t)
        []
    elseif t isa NTuple
        T = typeof(t)
        Base.collect(T.parameters[1], t)
    else
        Base._collect(1:1 #= Array =#, t, Base.IteratorEltype(t), Base.IteratorSize(t))
    end
end

and then find:

@time WindowsPath2(t1)
#   0.028509 seconds (11.37 k allocations: 754.023 KiB, 99.71% compilation time: 52% of which was recompilation)
# p"hello\world"

@time WindowsPath2(t2)
#   0.000023 seconds (2 allocations: 128 bytes)
# p"hello\world\foo"

Maybe, I'm overlooking something here? The remedy of a faster collect() of NTuples is certainly an improvement. But I think the same kind of improvement should be done internally for any broadcasting over NTuples.

JeffBezanson commented 1 year ago

Maybe we should add a performance tip to the manual about not using tuples for large, varying-size collections.

hhaensel commented 1 year ago

Maybe, I'm overlooking something here? Well, I did miss a decrease of performance which is annoying. My bad! (more details in #50960)

For FilePathsBase it is probably ok to sacrifice a bit of compile time, as the number of nested directories is limited.