KristofferC commented 4 years ago

A significant amount of time when precompiling the package is from compiling methods like:

...
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 53}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 54}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 55}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 56}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 57}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 58}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 59}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 60}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 61}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 62}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 63}}})
precompile(Tuple{typeof(FixedPointNumbers.showtype), Base.GenericIOBuffer{Array{UInt8, 1}}, Type{FixedPointNumbers.Normed{UInt64, 64}}})
...

in

https://github.com/JuliaMath/FixedPointNumbers.jl/blob/f57e96b43ba575a05d08f62cb216ba633daf1ab8/src/fixed.jl#L32-L41

By avoiding specializing showtype on the type, the time for precompilation gets cut in about half.

Before:

1.999983 seconds (732.52 k allocations: 42.130 MiB, 0.25% gc time)

After:

0.899459 seconds (729.63 k allocations: 41.771 MiB, 0.70% gc time)

Another option could be to run that piece of code in a separate module and have that module run without inference.

codecov[bot] commented 4 years ago

Codecov Report

Merging #232 into master will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master     #232   +/-   ##
=======================================
  Coverage   96.24%   96.24%           
=======================================
  Files           6        6           
  Lines         692      692           
=======================================
  Hits          666      666           
  Misses         26       26

Impacted Files	Coverage Δ
src/FixedPointNumbers.jl	`96.49% <100.00%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update f57e96b...bc91408. Read the comment docs.

kimikage commented 4 years ago

Doesn't this affect the run-time performance of commonly used types like N0f8? I think it's better not to use showtype around the @eval macros for aliases.

KristofferC commented 4 years ago

Doesn't this affect the run-time performance of commonly used types like N0f8?

Could you give an example benchmark of what you mean?

kimikage commented 4 years ago

This isn't the most up-to-date, but it's FYI. https://github.com/JuliaMath/FixedPointNumbers.jl/pull/189#issuecomment-653765909

The absolute value of the benchmark itself is not very meaningful because of the overhead of IO, but it actually affects the speed of printing for ColorTypes.

Edit: Perhaps I misled you, but I have shown here that there is a problem with the current implementation of showtype. As I said in that comment, my feeling is that this is too slow.

The reason I used write instead of print is not because it is suitable with optimizations. I used write as a mitigation because showtype did not work well with optimizations (specializations). Of course, this is an additional point, not what I wanted to tell you.

timholy commented 4 years ago

@kimikage, could you test this branch and see?

Another alternative would be to write this as

@inline function showtype(io::IO, ::Type{X}) where {X <: FixedPoint}
    hasalias(X) && return _showtype(io, typechar(X), nbitsfrac(X), nbitsint(X))
    _print(io, X)
    return io
end

@noinline function _showtype(io::IO, c::Char, f::Int, m::Int)
    write(io, c)
    m > 9 && write(io, Char(m ÷ 10 + 0x30))
    write(io, Char(m % 10 + 0x30), 'f')
    f > 9 && write(io, Char(f ÷ 10 + 0x30))
    write(io, Char(f % 10 + 0x30))
    return io
end

_print(io::IO, @nospecialize(X)) = print(io, X)

_showtype gets specialized only once, which might deliver most of the benefits of this PR (I get 2.5s vs 1.5s).

kimikage commented 4 years ago

Ah, I did submit a PR for clarification of my idea. I don't mean to force the use of PR #233. I think it's essentially the same as @timholy's idea. (Edit: in terms of aggregating a large amount of specialization into a single non-specialized method.)

KristofferC commented 4 years ago

This isn't the most up-to-date, but it's FYI. #189 (comment) The absolute value of the benchmark itself is not very meaningful because of the overhead of IO, but it actually affects the speed of printing for ColorTypes.

I tried

julia> io = IOBuffer()

julia> @btime FixedPointNumbers.showtype($io, N63f1)
  34.884 ns (0 allocations: 0 bytes)

and I get no difference between this PR and master.

Is the worry that you will print a FixedPointType hundreds of millions of times?

kimikage commented 4 years ago

Using $io encourages the constant folding. Unfortunately, that doesn't match most actual use cases.

It's not up to a million, but 1000 times is a realistic value in images. That's not the recommended use case, though.

timholy commented 4 years ago

But the io argument's specialization is not affected by this PR?

Could you come up with a benchmark that shows the effect you're worried about?

KristofferC commented 4 years ago

Using $io encourages the constant folding. Unfortunately, that doesn't match most actual use cases.

I don't really get it. How would you "constant fold" on an IOBuffer()?

kimikage commented 4 years ago

I don't really get it. How would you "constant fold" on an IOBuffer()?

I'm not sure if constant folding is the proper term in this case, but the specialization reduces the function calls.

You are much more familiar with Julia, so I'm not going to complain about this any more.

KristofferC commented 4 years ago

You are much more familiar with Julia, so I'm not going to complain about this any more.

I'm happy to make any changes you want. Just trying to figure out a good way to measure so I know they are effective.

timholy commented 4 years ago

Agreed---this isn't a critique, we're just trying to understand whether there's something important that we're overlooking.

Let's leave this open for a day or so and see if someone comes up with a benchmark that favors one approach over the other. If not, let's merge.

kimikage commented 4 years ago

It's still slow but this PR (and upcoming changes in ColorTypes) can reduce the number of times where the type aliases are printed. Therefore, I will not optimize showtype with a string dictionary or something.

Did you read this comment? It's a fair argument in one sense that there's no need for specialization because there's no difference in the current slow implementation, but in another sense it's not fair.

Currently, a "single" ARGB32 color is printed as follows.

julia> ARGB32(0.1,0.2,0.3,0.4)
ARGB32(0.102N0f8,0.2N0f8,0.298N0f8,0.4N0f8)

In this case (i.e. without https://github.com/JuliaGraphics/ColorTypes.jl/pull/206), the "fully-optimized" version will certainly speed up the ARGB32s printing.

I'm not inherently opposed to this PR, but I am uncomfortable with the bias in the decision-making process.

KristofferC commented 4 years ago

It's a fair argument in one sense that there's no need for specialization because there's no difference in the current slow implementation, but in another sense it's not fair.

How is that not fair? If you get the same performance with / without, better not to compile hundreds of redundant versions of the same method?

I'm not inherently opposed to this PR, but I am uncomfortable with the bias in the decision-making process.

All I have done is:

Change something that made precompilation faster.
Tried my best to benchmark if this has any effect in a realistic workload.

Which bias are you pointing to? Also, even if this made the function say two times slower you would have to print the type tens of millions of times to notice a difference. And printing types is not done that much.

kimikage commented 4 years ago

I think the printing still slow (in ColorTypes).

but it actually affects the speed of printing for ColorTypes.

It's a matter of feeling, but you seem to be neglecting the issue.

Again, I don't see a problem with this PR. I think it might be better to avoid calling the showtype than to specify @nospecialize.

KristofferC commented 4 years ago

It's a matter of feeling, but you seem to be neglecting the issue.

I don't think that the time to print something can be a matter of feeling since it can be measured. With "the issue", do you mean I should benchmark the time to print a matrix of ColorTypes to the REPL, for example?

I think it might be better to avoid calling the showtype than to specify @nospecialize.

If you have an option you like more, feel free to go with that. But if there is no performance gain to specialize it seems unnecessary to compile this function for all the different types during runtime as well. This @nospecialize is not much different from many cases in Julia Base where things that do not get a significant performance improvement get a @nospecialize (frequently done for e.g. DataTypes).

timholy commented 4 years ago

I'm totally happy to close this without merging, all that it would require is a benchmark that proves it's a bad idea. I don't see any bias in that. @KristofferC has supplied a benchmark that shows a convincing compile-time advantage; at the moment all we have is a theoretical runtime disadvantage, but efforts to actually demonstrate it have not indicated that it's real. Currently the balance of incontrovertible evidence is strongly in favor of merging, but that could change in an instant with the right benchmark.

If you need more time to generate one, that's fine too; this isn't urgent, since the times involved are not huge either way.

kimikage commented 4 years ago

I don't mean to pick a fight, but @KristofferC doesn't benchmark the disadvantages of not doing specialization. I'm unhappy that I'm being treated like I'm not being fair. :confused:

but efforts to actually demonstrate it have not indicated that it's real.

Efforts? We cannot benchmark what does not currently exist. How can I show the problem other than in theory? In theory, the importance of showtype in printing Colors is obvious.

using FixedPointNumbers

function FixedPointNumbers.showtype(io::IO, ::Type{N0f8})
    print(io, "N0f8")
    io
end

using ColorTypes, BenchmarkTools

img = reinterpret.(ARGB32, rand(UInt32, 32, 32));

function dump(io, img32x32)
    for y in 1:32
        for x in 1:32
            @inbounds print(io, img32x32[y, x], '\t')
        end
        println(io)
    end
end

io = IOBuffer();

julia> versioninfo()
Julia Version 1.5.1
Commit 697e782ab8 (2020-08-25 20:08 UTC)
Platform Info:
  OS: Windows (x86_64-w64-mingw32)
  CPU: Intel(R) Core(TM) i7-8565U CPU @ 1.80GHz
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-9.0.1 (ORCJIT, skylake)

julia> @benchmark dump(io, $img)
BenchmarkTools.Trial:
  memory estimate:  1.70 MiB
  allocs estimate:  9216
  --------------
  minimum time:     621.000 μs (0.00% GC)
  median time:      1.327 ms (0.00% GC)
  mean time:        1.516 ms (17.37% GC)
  maximum time:     176.787 ms (0.18% GC)
  --------------
  samples:          3306
  evals/sample:     1

julia> @benchmark dump(io, $img) # nospecialize
BenchmarkTools.Trial:
  memory estimate:  1.70 MiB
  allocs estimate:  9216
  --------------
  minimum time:     756.400 μs (0.00% GC)
  median time:      1.542 ms (0.00% GC)
  mean time:        2.029 ms (14.41% GC)
  maximum time:     527.388 ms (1.27% GC)
  --------------
  samples:          2583
  evals/sample:     1

Of course, this is bogus and does not indicate a problem with this PR. I'm not saying we have a problem with using @nospecialize, I'm saying it would be better to not use @nospecialize.

First of all, Julia v1.6.0-DEV has a fatal regression in printing Colors. That shouldn't have anything to do with this PR, but I'm interested in a benchmark that you've determined has no practical problems with the printing. (Perhaps it relates to the abandonment of the inference in Ryu, but I haven't kept track of it.)

kimikage commented 4 years ago

I didn't submit PR #233 to point out the problems with @nospecialize. I believe PR #233 will be needed for #228 eventually (before the release of v0.9 due to the depwarn). I thought the idea of PR #233 would be acceptable to you folks, so I'm surprised that the subject of discussion went to @nospecialize.

KristofferC commented 4 years ago

I don't mean to pick a fight, but @KristofferC doesn't benchmark the disadvantages of not doing specialization.

I thought I did by benchmarking the function that has @nospecialize would measure it but maybe there are better ways or maybe this doesn't benchmark it properly. Happy to learn.

I'm surprised that the subject of discussion went to @nospecialize.

I was just trying out some ways of seeing what functions end up getting compiled when packages are precompiled. I loaded Plots (I think) and this just stuck out to me that it was compiling hundreds of versions of the same method. Since it was a method where performance is not usually important, I just did a quick fix, noticed it halved the time to precompile that packages and made a PR.

Anyway, feel free to go with https://github.com/JuliaMath/FixedPointNumbers.jl/pull/233/ or whatever you prefer. This PR went in a strange direction indeed.

kimikage commented 4 years ago

I apologize for disrupting the discussion.

I think the timing was a bit unlucky this time. While refactoring the show in the ColorTypes, I noticed that FixedPointType.showtype was too slow. So I got rid of the minimal problem, but I was faced with the problem that the specialization would not remove the function calls in a "safe way". (I also realized that benchmarking with $io was not good.) In the end, I merged PR #189 with the slow showtype.

Evaluating the difference in speed due to changes (with/without @nospecialize) is generally a reasonable benchmark. However, it is just a comparison to the status quo. I recognized that it was clearly bad to need ~100 ns to write just 4 bytes of "N0f8", and I know that it would affect the text printing of images. I regret that I did not complete the necessary work before you submitted this PR.

PR #233 is not technically superior to this PR, and it is simply a matter of convenience in the development of this package. (From a technical standpoint, @timholy's idea may be the most technical.:smile:)

I believe the developers have benefited from the fact that such a small change can make a big difference in precompilation time. :+1:

JuliaMath / FixedPointNumbers.jl

dont specialize on type in showarg #232

Codecov Report