JuliaLang / julia

The Julia Programming Language
https://julialang.org/
MIT License
45.6k stars 5.48k forks source link

println is 10 ms slower than print (first use), likely missing precompiles on 1.11.1, and down to 1.10.5 actually; both allocate too [much] (and not as much), after precompilation #56200

Open PallHaraldsson opened 5 hours ago

PallHaraldsson commented 5 hours ago

A. That PR is (upcoming, unless it slightly modified for this) 1.11.1:

$ hyperfine 'julia +pr56180 -e "print(1)"'
Benchmark 1: julia +pr56180 -e "print(1)"
  Time (mean ± σ):     193.9 ms ±  15.5 ms    [User: 354.7 ms, System: 64.3 ms]
  Range (min … max):   168.7 ms … 212.1 ms    14 runs

I never get close to 168.7 ms for:

$ hyperfine 'julia +pr56180 -e "println(UInt64(1))"'
Benchmark 1: julia +pr56180 -e "println(UInt64(1))"
  Time (mean ± σ):     205.8 ms ±  15.9 ms    [User: 456.6 ms, System: 64.5 ms]
  Range (min … max):   178.8 ms … 225.2 ms    13 runs

$ julia +pr56180 --trace-compile=stderr -e "print(1)"
precompile(Tuple{typeof(Base.print), Int64})
precompile(Tuple{typeof(Base.print), Base.TTY, Int64})
1

$ julia +pr56180 --trace-compile=stderr -e "println(UInt64(1))"
precompile(Tuple{typeof(Base.println), UInt64})
precompile(Tuple{typeof(Base.println), Base.TTY, UInt64})
1

Both seem to miss the same types of precompiles, and I'm not sure if it's println vs print (and for what reason; slower to compile first) and/or Int[64] vs UInt64. Would be nice to figure out and fix before 1.11.1... (or 1.11.2 or 1.12).

This sort of thing is common to most programs, so argument for being precompiled into the sysimage? Will it bloat it much, even slow the above down?

B. Note, faster AND also fewer allocations:

julia> @time print(1);
1  0.000053 seconds (3 allocations: 72 bytes)

than:
julia> @time println(1)
1
  0.000061 seconds (5 allocations: 120 bytes)

this has even 1 fewer allocations:

julia> @time (print(1); print("\n"))
1
  0.000066 seconds (4 allocations: 88 bytes)

usually but NOT always...

julia> @time (print(1); print("\n"))
1
  0.000064 seconds (5 allocations: 376 bytes)

Ideally these would have no allocations at runtime, nor at compile-time, i.e. since it should be eliminated...

You might think with Char, '\n', faster, but it has actually 1-2 allocations more (non-ideal for explaining optimal code [for benchmarks...]):

julia> @time (print(1); print('\n'))
1
  0.000065 seconds (6 allocations: 392 bytes)

C.

Strange optimization saving (an allocation):

julia> @time println(big"1")
1
  0.000056 seconds (4 allocations: 88 bytes)

vs.

julia> @time println(1)
1
  0.000064 seconds (5 allocations: 120 bytes)

Could this or any of my tests be about the global scope?
KristofferC commented 4 hours ago

precompile(Tuple{typeof(Base.setindex!), Base.EnvDict, Bool, String})

I guess this is from your startup file?

Ideally these would have no allocations at runtime, nor at compile-time, i.e. since it should be eliminated...

As you have seen (https://github.com/JuliaLang/julia/pull/38227), currently printing bitstype requires an allocation because of some internal reason that it passes something as a pointer.

PallHaraldsson commented 3 hours ago

YES! Base.EnvDict now redacted from the top, it was a distraction (not changing the conclusion), I usually do not forget --startup-file=no and only this was in my file:

1+2
ENV["JULIA_PKG_PRESERVE_TIERED_INSTALLED"] = false # true