JuliaSIMD / VectorizedRNG.jl

Vectorized uniform and normal random samplers.
MIT License
33 stars 7 forks source link

Startup time #5

Open PallHaraldsson opened 4 years ago

PallHaraldsson commented 4 years ago

I see you threw out PCG, because of your concern for the startup time.

While ProfileView.jl is really good, it's maybe not too helpful to do:

@profview using VectorizedRNG

or at least I'm not going to try to read into it (it's the first time I used in on using). For the first run using time (excluding when Precompiling) however, there's a gap, seemingly like it's doing nothing (much) for about 10% of the time.

That can't be right, I can upload a screenshot if you don't notice this right away. Any idea what's happening?

The second using while much faster isn't really too fast, and while you wouldn't maybe or at least shouldn't(?) be doing that, it also has an interesting profile.

Your package is not unique to this second using "issue", bu it's not shared by all packages so (so not a julia issue), I'm a little curious about it too.

[ProfileSVG.jl is also interesting, I tried it, and it generates 4 MB SVG file, and my Firefox took its time displaying fully with "your web page is slowing down your browser" and showing 8 screen-fulls if I recall...]

$ julia
julia> @time using VectorizedRNG
  2.090021 seconds (3.48 M allocations: 178.768 MiB, 7.77% gc time)

julia> @time using VectorizedRNG
  0.644264 seconds (1.16 M allocations: 54.428 MiB, 1.76% gc time)

julia> @time using VectorizedRNG
  0.000276 seconds (566 allocations: 29.734 KiB)

First time (not really a concern, but for some, maybe all the delay they'll ever see?):
julia> @time using VectorizedRNG
[ Info: Precompiling VectorizedRNG [33b4df10-0173-11e9-2a0c-851a7edac40e]
 12.832126 seconds (4.41 M allocations: 222.948 MiB, 0.89% gc time)
chriselrod commented 4 years ago

Your package is not unique to this second using "issue", bu it's not shared by all packages so (so not a julia issue), I'm a little curious about it too.

Hmm. At first I thought it may have something to do with the init function that gets called when the library gets loaded. However, it seems __init__ only gets called the first time. I base that on the fact that the init function randomly initializes the seed. Setting VectorizedRNG.seed! followed by using VectorizedRNG doesn't seem to have any effect on the RNG.

I suspect the init, and the need to compile some of the functions it calls, is a part of the reason for the lag. Still, that makes me uncomfortable with making it a dependency for other libraries.

I'm far from an expert on compilation time, so anything you find would be greatly appreciated.

PallHaraldsson commented 4 years ago

there's a gap, seemingly like it's doing nothing (much)

I found out about my so-called "gap", I at least wouldn't worry about it, even, the larger the better. I would worry more about the non-gap time, IF it's running your code.

For you _include_from_serialized was the "gap", and I think it's part of the process of loading the precompiled code in, very useful. For you, eyeballing, about 1/4 of the time (for line 681).

But that code takes almost all of your time (a good thing), if you look carefully (not just that line 681 though).

I no longer think it's doing nothing/waiting for disk, as the gap disappears if you profile into C code too:

julia> using Profile
julia> using ProfileView

julia> @profile using VectorizedRNG

julia> ProfileView.view(C = true, fontsize = 30)
chriselrod commented 3 years ago

Solved on Julia master?

julia> @time using VectorizedRNG
[ Info: Precompiling VectorizedRNG [33b4df10-0173-11e9-2a0c-851a7edac40e]
  1.079043 seconds (332.52 k allocations: 20.824 MiB, 0.77% gc time, 7.75% compilation time)

julia> @time using VectorizedRNG
  0.000069 seconds (81 allocations: 5.984 KiB)

julia> @time using VectorizedRNG
  0.000072 seconds (81 allocations: 5.984 KiB)
PallHaraldsson commented 3 years ago

Yes, the second using problem is gone (but seems could load faster, see e.g. with OhMyREPL below):

julia> @time @time using VectorizedRNG
  0.948739 seconds (1.01 M allocations: 71.165 MiB, 2.72% gc time, 0.75% compilation time)
  1.008083 seconds (1.07 M allocations: 75.008 MiB, 2.56% gc time, 0.71% compilation time)

julia> @time @time using VectorizedRNG
  0.000195 seconds (81 allocations: 6.062 KiB)
  0.041186 seconds (12.95 k allocations: 813.111 KiB)

julia> @time @time using VectorizedRNG
  0.000182 seconds (81 allocations: 6.062 KiB)
  0.000309 seconds (119 allocations: 8.297 KiB)

It's your call if just under 1 sec. is good enough.

@timholy, Also FYI, a bit strange that the second using "problem" is back if I use my startup file (and that (first) using gets this much slower) even without Revise installed:

┌ Warning: error while importing Revise
│   e =
│    ArgumentError: Package Revise not found in current path:
│    - Run `import Pkg; Pkg.add("Revise")` to install the Revise package.
│    
└ @ Main ~/.julia/config/startup.jl:5
julia> @time @time using VectorizedRNG
  2.383549 seconds (2.10 M allocations: 131.875 MiB, 0.66% gc time, 0.36% compilation time)
  2.454419 seconds (2.16 M allocations: 135.683 MiB, 0.64% gc time, 0.35% compilation time)

julia> @time @time using VectorizedRNG
  0.059570 seconds (32.79 k allocations: 1.891 MiB, 99.72% compilation time)
  0.059708 seconds (32.83 k allocations: 1.894 MiB, 99.49% compilation time)

julia> @time @time using VectorizedRNG
  0.000189 seconds (81 allocations: 6.062 KiB)
  0.043761 seconds (12.95 k allocations: 813.111 KiB)

julia> @time @time using VectorizedRNG
  0.000186 seconds (81 allocations: 6.062 KiB)
  0.000317 seconds (119 allocations: 8.297 KiB)

It's an outdated method (but I thought wouldn't slow this much down):

~/.julia/config/startup.jl

atreplinit() do repl
    try
        @eval @time using Revise
    catch e
        @warn "error while importing Revise" e
    end
end

Note, since Revise wasn't installed, I tried that, then not as bad, still slower (and with same speed with the above startup-file, so only catch-clause seems slow):

julia> @time using Revise
  0.650733 seconds (563.69 k allocations: 39.741 MiB, 2.32% gc time, 1.69% compilation time)

julia> @time @time using VectorizedRNG
  1.745308 seconds (1.43 M allocations: 95.068 MiB, 1.14% gc time)
  1.745396 seconds (1.43 M allocations: 95.071 MiB, 1.14% gc time)

julia> @time @time using VectorizedRNG
  0.000213 seconds (81 allocations: 6.062 KiB)
  0.043323 seconds (12.95 k allocations: 813.111 KiB)

julia> @time @time using VectorizedRNG
  0.000207 seconds (81 allocations: 6.062 KiB)
  0.000357 seconds (119 allocations: 8.297 KiB)

In contrast if I only do, then a bit faster/fewer allocations, than using your package alone with or without Revise:

julia> @time @time using OhMyREPL

julia> @time @time using VectorizedRNG
  0.899325 seconds (896.82 k allocations: 63.453 MiB, 1.95% gc time)
  0.899427 seconds (896.87 k allocations: 63.456 MiB, 1.95% gc time)

[here, not a lot faster but fewer allocations above, I think indication that you could do more procompilation.]

julia> @time @time using VectorizedRNG
  0.000095 seconds (81 allocations: 6.062 KiB)
  0.000157 seconds (119 allocations: 8.297 KiB)

julia> @time @time using VectorizedRNG
  0.000147 seconds (81 allocations: 6.062 KiB)
  0.000254 seconds (119 allocations: 8.297 KiB)

This is not really your concern, just showing you what can happen even if Revise isn't installed (since I recently upgraded to 1.7).

I needed to do:

(@v1.7) pkg> add https://github.com/timholy/Revise.jl/

strangely (even on Julia 1.5):

(@v1.7) pkg> add Revise
[ Info: Resolving package identifier `Revise` as a directory at `~/Revise`.
ERROR: Did not find a git repository at `Revise`
chriselrod commented 3 years ago

The startup time can probably mostly be blamed on VectorizationBase.jl, one of VectorizedRNG.jl's dependencies:

chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.452552 seconds (1.01 M allocations: 70.875 MiB, 8.86% compilation time)
  0.485739 seconds (1.07 M allocations: 74.718 MiB, 8.25% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.447231 seconds (1.01 M allocations: 70.875 MiB, 8.99% compilation time)
  0.480055 seconds (1.07 M allocations: 74.718 MiB, 8.37% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.461119 seconds (1.01 M allocations: 70.875 MiB, 8.93% compilation time)
  0.493712 seconds (1.07 M allocations: 74.718 MiB, 8.34% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.492746 seconds (1.04 M allocations: 73.331 MiB, 9.30% compilation time)
  0.522391 seconds (1.10 M allocations: 77.174 MiB, 8.77% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.491345 seconds (1.04 M allocations: 73.331 MiB, 9.36% compilation time)
  0.523852 seconds (1.10 M allocations: 77.174 MiB, 8.78% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.510427 seconds (1.04 M allocations: 73.331 MiB, 9.01% compilation time)
  0.541200 seconds (1.10 M allocations: 77.174 MiB, 8.50% compilation time)

jm is an alias for julia master.

Also, as of Julia 1.5, you only need to using Revise in your startup.jl.

Also, using Revise and using OhMyREPL first:

chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using OhMyREPL; @time @time using VectorizedRNG"
  0.418813 seconds (923.11 k allocations: 65.085 MiB)
  0.449518 seconds (982.77 k allocations: 68.927 MiB)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using OhMyREPL; @time @time using VectorizedRNG"
  0.423458 seconds (923.11 k allocations: 65.087 MiB)
  0.453815 seconds (982.77 k allocations: 68.929 MiB)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using Revise; @time @time using VectorizedRNG"
  0.922822 seconds (1.51 M allocations: 99.688 MiB, 2.59% gc time)
  0.951921 seconds (1.57 M allocations: 103.530 MiB, 2.51% gc time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using Revise; @time @time using VectorizedRNG"
  0.938457 seconds (1.51 M allocations: 99.688 MiB, 2.57% gc time)
  0.968043 seconds (1.57 M allocations: 103.530 MiB, 2.49% gc time)

I take it that some code loading code isn't precompiled, so loading another package first helps. But that Revise's code tracking methods must get compiled the first time you load a package? This is still a little slower:

chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using Revise; using OhMyREPL; @time @time using VectorizedRNG"
  0.670337 seconds (1.09 M allocations: 74.920 MiB)
  0.699811 seconds (1.15 M allocations: 78.761 MiB)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "using Revise; using OhMyREPL; @time @time using VectorizedRNG"
  0.674070 seconds (1.09 M allocations: 74.920 MiB)
  0.704318 seconds (1.15 M allocations: 78.761 MiB)

There's a PR to VectorizationBase that'll merge fairly soon that has a fair bit of impact on the package's loading, I'll probably spend some time looking at package latency there. Both VectorizationBase and VectorizedRNG do non-trivail things in their inits, especially VectorizationBase in that PR, which unfortunately could hurt package loading times. But those changes will make it safe to precompile more code; the work being done are checks to see if it has to invalidate anything.

chriselrod commented 3 years ago

That PR of VectorizationBase has a definite latency regression:

chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.702441 seconds (1.86 M allocations: 121.201 MiB, 1.73% gc time, 7.57% compilation time)
  0.731387 seconds (1.92 M allocations: 125.044 MiB, 1.67% gc time, 7.27% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.703449 seconds (1.86 M allocations: 121.201 MiB, 1.45% gc time, 6.43% compilation time)
  0.732682 seconds (1.92 M allocations: 125.044 MiB, 1.39% gc time, 6.17% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizedRNG"
  0.706482 seconds (1.86 M allocations: 121.201 MiB, 1.42% gc time, 6.77% compilation time)
  0.736739 seconds (1.92 M allocations: 125.044 MiB, 1.36% gc time, 6.49% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.687375 seconds (1.83 M allocations: 118.905 MiB, 1.44% gc time, 7.12% compilation time)
  0.716584 seconds (1.89 M allocations: 122.748 MiB, 1.38% gc time, 6.83% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.675909 seconds (1.83 M allocations: 118.907 MiB, 1.45% gc time, 6.78% compilation time)
  0.703710 seconds (1.89 M allocations: 122.750 MiB, 1.40% gc time, 6.52% compilation time)
chriselrod@chriselrod-XPS-13-9310 ~> jm --startup=no -e "@time @time using VectorizationBase"
  0.663497 seconds (1.83 M allocations: 118.905 MiB, 1.52% gc time, 6.92% compilation time)
  0.691504 seconds (1.89 M allocations: 122.748 MiB, 1.46% gc time, 6.64% compilation time)

Much of that PR was about reducing the number of @generated functions, which would ideally help. I'll have to look at this further.

As an update after further modifications, the @time using VectorizationBase has regressed further, but the total time it takes to run its test suite has improved by more than 10 seconds.