JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.91k stars 5.49k forks source link

Performance regression in 1.0.1 #29464

Closed jmichel7 closed 6 years ago

jmichel7 commented 6 years ago

One of my programs which took 1mn on 1.0.0 takes 5mn on 1.0.1 I tracked at least part of the problem to the following code snippet:

function foo(gens,elts::Vararg{Vector{Int}})
  ee=collect(elts)
  for i in eachindex(gens)
    if let i=i
      all(w->w[i]>w[i+1],ee)
       end
      error("should not happen")
    end
  end
end

l=map(i->collect(1:5),1:4)
for i in 1:4 l[i][[i,i+1]]=l[i][[i+1,i]] end
const gens=l
const a=[2,1,3,4,5]
const b=[1,2,4,3,5]
test()=foo(gens,a,b)

On 1.0.0

julia> @btime test()
  176.287 ns (2 allocations: 128 bytes)

On 1.0.1:

julia> @btime test()
  695.735 ns (6 allocations: 192 bytes)

Does anyone know what’s happening here?

KristofferC commented 6 years ago

I'm bisecting

KristofferC commented 6 years ago
commit b5e2c2ff2f563140fa707674ec0789e03951b170
Author: Klaus Crusius <KlausC@users.noreply.github.com>
Date:   Wed Sep 12 23:35:11 2018 +0200

    Allow `Iterators.flatten` for empty tuple (#29112)

    (cherry picked from commit b4c370dd05ac33d0c8c02b440b861e8582d1fc74)

:040000 040000 b0f6ee2c6abb7b729a78c14b39dc9f823289976c 817f36cd24bdc1696ce76067b707506378748205 M      base
:040000 040000 07f50c8dce7d7a9b2fb21608bd1a0c94ac991c20 adf4a997f804fbcdab58c7df9356eb74ddde2a6d M      test

cc @KlausC #29112

KristofferC commented 6 years ago

@code_typed is identical for the function on the commit with the regression and the one before...

jmichel7 commented 6 years ago

Oops! I closed the issue by a misclick. I wanted to say: I do not know if the issue has been identified, but if this may help here is a simplified snippet which still shows exactly the same problem:

 function foo(elts::Vararg{Vector{Int}})
  ee=collect(elts)
  for i in 1:4
    if all(w->w[i]>w[i+1],ee)
      error("should not happen")
    end
  end
end

const a=[2,1,3,4,5]
const b=[1,2,4,3,5]
test()=foo(a,b)
KristofferC commented 6 years ago

Here is a way to generate arbitrary slowdown:

N = 10000
a = collect(1:N)
b = collect(2:N+1)

function foo(elts::Vararg{Vector{Int}})
    ee=collect(elts)
    for i in 1:length(ee[1])-1
       f = w -> w[i] > w[i+1]
       all(f,ee)
    end
end

using BenchmarkTools

@btime foo($a,$b)

0.7:

julia> @btime foo($a,$b)
  12.316 μs (2 allocations: 128 bytes)

1.0.1

julia> @btime foo($a,$b)
  3.672 ms (38979 allocations: 765.39 KiB)

Profiling shows that we spend a lot in the runtime.

julia> Profile.print(noisefloor=5, C=true)
3255 /Users/kristoffer/julia/src/task.c:271; start_task
 3255 /Users/kristoffer/julia/src/./julia.h:1559; jl_apply
  3255 ./task.jl:259; (::getfield(REPL, Symbol("##26#27")){REPL.REPLBackend})()
   3255 .../julia/stdlib/v1.1/REPL/src/REPL.jl:117; macro expansion
    3255 .../julia/stdlib/v1.1/REPL/src/REPL.jl:85; eval_user_input(::Any, ::REPL.REPLBackend)
     3255 ./boot.jl:319; eval(::Module, ::Any)
      3255 ...rs/kristoffer/julia/src/builtins.c:622; jl_toplevel_eval_in
       3255 ...rs/kristoffer/julia/src/toplevel.c:812; jl_toplevel_eval_flex
        3255 /Users/kristoffer/julia/src/gf.c:1843; jl_fptr_trampoline
         3255 ...tdlib/v1.1/Profile/src/Profile.jl:25; top-level scope
          3255 ./REPL[9]:1; macro expansion
           291  ./REPL[4]:3; foo(::Array{Int64,1}, ::Vararg{Array{Int64,1},N} ...
            204 .../kristoffer/julia/src/builtins.c:731; jl_f_getfield
             79 .../kristoffer/julia/src/datatype.c:543; jl_new_bits
              74 .../kristoffer/julia/src/datatype.c:681; jl_box_int64
               63 ...er/julia/src/./julia_internal.h:263; jl_gc_alloc_
           1388 ./REPL[4]:4; foo(::Array{Int64,1}, ::Vararg{Array{Int64,1},N} ...
            778 ...s/kristoffer/julia/src/jltypes.c:879; jl_apply_type
             755 ...s/kristoffer/julia/src/jltypes.c:925; jl_instantiate_unionall
              535 .../kristoffer/julia/src/jltypes.c:1499; inst_type_w_
               535 .../kristoffer/julia/src/jltypes.c:1307; inst_datatype
                493 ...kristoffer/julia/src/jltypes.c:1301; inst_datatype_env
                 437 ...kristoffer/julia/src/jltypes.c:1297; inst_datatype_env
                  293 ...ristoffer/julia/src/jltypes.c:1092; inst_datatype_inner
                   221 ...ristoffer/julia/src/jltypes.c:716; lookup_type
                    118 ...ristoffer/julia/src/jltypes.c:678; lookup_type_idx
           1553 ./REPL[4]:5; foo(::Array{Int64,1}, ::Vararg{Array{Int64,1},N} ...
            828 /Users/kristoffer/julia/src/gf.c:2194; jl_apply_generic
             643 /Users/kristoffer/julia/src/gf.c:2148; jl_lookup_generic_
              628 ...er/julia/src/./julia_internal.h:876; jl_typemap_assoc_exact
               431 .../kristoffer/julia/src/typemap.c:798; jl_typemap_entry_assoc_exact
                385 ...kristoffer/julia/src/subtype.c:1229; jl_tuple_isa
                 261 ...kristoffer/julia/src/subtype.c:1283; jl_isa
                  261 ...ristoffer/julia/src/subtype.c:1198; jl_subtype
                   261 ...ristoffer/julia/src/subtype.c:1180; jl_subtype_env
                    115 ...ristoffer/julia/src/subtype.c:1120; forall_exists_subtype
                     111 ...istoffer/julia/src/subtype.c:1092; exists_subtype
            280 ...er/julia/usr/lib/julia/sys.dylib:?; jfptr_iterate_7082
             156 /Users/kristoffer/julia/src/gc.c:965; jl_gc_pool_alloc
             328 ./reducedim.jl:658; all(::Function, ::Array{Array{Int64,1},1})
              285 ./reducedim.jl:658; #all#550(::Function, ::Function, ::Function, ::A...
               94  /Users/kristoffer/julia/src/gf.c:2194; jl_apply_generic
KristofferC commented 6 years ago

Ok, it is a code specialization heuristic. Changing the function declaration from

foo(elts::Vararg{Vector{Int}})

to

foo(elts::Vararg{Vector{Int}, N}) where {N}

fixes it:

julia> @btime foo($a,$b)
  11.868 μs (1 allocation: 96 bytes)
mauro3 commented 6 years ago

Is this more or less what is described here: https://docs.julialang.org/en/v1/devdocs/functions/#Compiler-efficiency-issues-1?

KristofferC commented 6 years ago

I guess, don't know how it is related to #29112 though. Or if I just worked around the problem with the extra specialization but the original issue is not fixed.

martinholters commented 6 years ago

The collect call there cannot be inferred in the non-specialized case. (Check code_warntype(foo, Tuple{Vararg{Vector{Int}}})). That's a consequence of Base.IteratorEltype not be inferable in that case:

julia> code_warntype(Base.IteratorEltype, Tuple{Tuple{Vararg{Vector{Int}}}})
Body::Union{EltypeUnknown, HasEltype}
123 1 ─ %1 = (Base.typeof)(x)::Type{#s55} where #s55<:Tuple{Vararg{Array{Int64,1},N} where N}                   │
    │   %2 = (isa)(%1, Type{Union{}})::Bool                                                                     │
    └──      goto #3 if not %2                                                                                  │
    2 ─      goto #4                                                                                            │
    3 ─ %5 = (Base.IteratorEltype)(%1)::Union{EltypeUnknown, HasEltype}                                         │
    └──      goto #4                                                                                            │
    4 ┄ %7 = φ (#2 => $(QuoteNode(Base.EltypeUnknown())), #3 => %5)::Union{EltypeUnknown, HasEltype}            │
    └──      return %7     

Removing this method, added in #29112: https://github.com/JuliaLang/julia/blob/1968b239fa6f5ab266e62e2240d828f37622ee83/base/generator.jl#L129 we get:

julia> code_warntype(Base.IteratorEltype, Tuple{Tuple{Vararg{Vector{Int}}}})
Body::Base.HasEltype
123 1 ─     return $(QuoteNode(Base.HasEltype()))     

and performance of the original example is back at 0.7 levels. Not sure what to do here.

StefanKarpinski commented 6 years ago

@JeffBezanson, any ideas how to do better here?

KristofferC commented 6 years ago

Fixed https://github.com/JuliaLang/julia/pull/29548