Performances 0.5 vs 0.6

matthieugomez commented 5 years ago

On my computer, StatsModels v0.6.0 is twice slower than v0.5.0 to construct a ModelMatrix.

using DataFrames, StatsModels
N = 10000000
df = DataFrame(y = randn(N), x1 = randn(N), x2 = randn(N))
@time ModelMatrix(ModelFrame(@formula(y~x1 + x2), df))
#> 0.365609 seconds (930 allocations: 840.530 MiB, 12.85% gc time)

and in v0.6

using DataFrames, StatsModels
N = 10000000
df = DataFrame(y = randn(N), x1 = randn(N), x2 = randn(N))
@time ModelMatrix(ModelFrame(@formula(y~x1 + x2), df))
#>  0.838185 seconds (278 allocations: 764.150 MiB, 43.21% gc time)

kleinschmidt commented 5 years ago

Interesting, I can't reproduce this on my laptop (I get about 1.2 for each of those if I use a NamedTuple instead of a DataFrame for 0.6, or ~1.3-4 (depending on the run) if I use a DataFrame). I thought you might be getting some compile time but the first run is actually even faster for 0.6 so I'm not sure what's driving this.

That being said, I don't doubt that there's a possible performance hit with 0.6. Could you try profiling to see where the slowdown is happening? And has performance become a problem in real use-cases for you?

matthieugomez commented 5 years ago

Environment with the best performances

(Try) pkg> status
    Status `~/FormerDataFrames/Project.toml`
  [a93c6f00] DataFrames v0.17.1 ⚲
  [3eaba693] StatsModels v0.5.0 ⚲

Contrary to what I said in my previous posts, I actually use both an old version of DataFrames and an old version of StatsModels. Output of @profile is here: https://gist.github.com/matthieugomez/5a38fa67716c21117466d1777fd60a93

My current Environment has:

  [3eaba693] StatsModels v0.6.3
  [a93c6f00] DataFrames v0.19.2

Output of @profile is here: https://gist.github.com/matthieugomez/4025512d58ff1798ddda9c976d019ecc

It is not a big deal, but it means we're now slower than Stata for linear models.

kleinschmidt commented 4 years ago

I'm looking at this again now; based on those profile outputs it seems like we're spending a huge amount of time in computing the schema. I'm wondering if there might be some difference between our systems that would lead to different performance of that code (which for continuous variables is just calling StatsBase.mean_and_var on the data column).

I'm also wondering, in a broader sense, whether it is really necessary to compute schemas by default for continuous terms. One (admittedly extreme) possibility would be to replace ContinuousTerms with just Terms which would just pull the underlying column out of the data and slap it on the model matrix (or multiply for an interaction term etc.). My original motivation for the current design is that it would be useful for doing things like spline or polynomial regression but I'm starting to think that it's 1) not providing enough information for all those applications (e.g. order-n splines need to know the n-quantiles of the data and we're not going to compute that for every reasonable choice of n) and 2) maybe causing more of a performance penalty than it's worth in some cases. The alternative would be to allow extensions to hook into the schema computation stage too, either by promoting function call terms to specialized terms before schema is called (e.g., #154) or by doing some kind of defensive schema extraction for all possible interpretations of a call (which seems like a bad idea the more I think about it).

kleinschmidt commented 4 years ago

Could you also benchmark the schema calls, too? Here's what I'm seeing:

On 0.6.11:

Total time is around 650 ms:

julia> f = @formula(y ~ x1 + x2);

julia> @benchmark  ModelMatrix(ModelFrame($f, $df))
BenchmarkTools.Trial:
  memory estimate:  764.15 MiB
  allocs estimate:  226
  --------------
  minimum time:     654.064 ms (1.16% GC)
  median time:      667.347 ms (6.52% GC)
  mean time:        703.305 ms (9.25% GC)
  maximum time:     856.611 ms (20.11% GC)
  --------------
  samples:          8
  evals/sample:     1

Just doing the schema-related bits (analogous to the old ModelFrame call) is almost a third of that time:

julia> @benchmark(apply_schema($f, schema($f, $df)))
BenchmarkTools.Trial:
  memory estimate:  4.75 KiB
  allocs estimate:  69
  --------------
  minimum time:     217.056 ms (0.00% GC)
  median time:      223.090 ms (0.00% GC)
  mean time:        226.688 ms (0.00% GC)
  maximum time:     252.471 ms (0.00% GC)
  --------------
  samples:          23
  evals/sample:     1

...and if we call ModelFrame directly it's even worse for reasons that aren't totally clear to me at the moment:

julia> @benchmark ModelFrame($f, $df)
BenchmarkTools.Trial:
  memory estimate:  306.38 MiB
  allocs estimate:  201
  --------------
  minimum time:     391.166 ms (0.00% GC)
  median time:      426.175 ms (4.06% GC)
  mean time:        444.168 ms (7.34% GC)
  maximum time:     566.443 ms (22.87% GC)
  --------------
  samples:          12
  evals/sample:     1

Once we've computed and applied the schema generating the model matrix is pretty fast:

julia> f2 = apply_schema(f, schema(f, df));
FormulaTerm
Response:
  y(continuous)
Predictors:
  x1(continuous)
  x2(continuous)

julia> @benchmark modelcols($f2, $df)
BenchmarkTools.Trial:
  memory estimate:  381.47 MiB
  allocs estimate:  26
  --------------
  minimum time:     207.257 ms (10.83% GC)
  median time:      219.150 ms (10.25% GC)
  mean time:        229.108 ms (14.30% GC)
  maximum time:     342.800 ms (42.68% GC)
  --------------
  samples:          22
  evals/sample:     1

Old version (0.5.0)

julia> @benchmark ModelMatrix(ModelFrame($f, $df))
BenchmarkTools.Trial:
  memory estimate:  840.55 MiB
  allocs estimate:  1219
  --------------
  minimum time:     475.430 ms (5.61% GC)
  median time:      523.050 ms (7.66% GC)
  mean time:        529.259 ms (9.64% GC)
  maximum time:     615.873 ms (16.58% GC)
  --------------
  samples:          10
  evals/sample:     1

julia> @benchmark ModelFrame($f, $df)
BenchmarkTools.Trial:
  memory estimate:  535.34 MiB
  allocs estimate:  910
  --------------
  minimum time:     337.232 ms (0.00% GC)
  median time:      382.772 ms (7.43% GC)
  mean time:        396.151 ms (10.89% GC)
  maximum time:     547.141 ms (31.57% GC)
  --------------
  samples:          13
  evals/sample:     1

julia> @benchmark ModelMatrix($(ModelFrame(f, df)))
BenchmarkTools.Trial:
  memory estimate:  305.20 MiB
  allocs estimate:  309
  --------------
  minimum time:     126.003 ms (0.00% GC)
  median time:      142.834 ms (0.00% GC)
  mean time:        154.821 ms (9.12% GC)
  maximum time:     203.342 ms (19.31% GC)
  --------------
  samples:          33
  evals/sample:     1

What's interesting to me is that the ModelMatrix bit is still faster, but the ModelFrame takes FOREVER, again, for reasons that aren't really clear to me since it's basically just copying the dataframe and parsing the terms so it should be pretty fast.

Overall suggests to me that there is a fair amount of performance for these big 1M+ row data sets left on the table (just from not computing more schemas than we need to).

matthieugomez commented 4 years ago

I'm on the same computer anymore. On the most recent version of StatsModels v0.6.11 DataFrames v0.20.2

julia> @benchmark  ModelMatrix(ModelFrame($f, $df))
BenchmarkTools.Trial: 
  memory estimate:  764.15 MiB
  allocs estimate:  226
  --------------
  minimum time:     467.644 ms (1.08% GC)
  median time:      478.772 ms (7.36% GC)
  mean time:        490.981 ms (9.07% GC)
  maximum time:     571.865 ms (20.67% GC)
  --------------
  samples:          11
  evals/sample:     1

julia> @benchmark(apply_schema($f, schema($f, $df)))
BenchmarkTools.Trial: 
  memory estimate:  4.75 KiB
  allocs estimate:  69
  --------------
  minimum time:     206.697 ms (0.00% GC)
  median time:      211.541 ms (0.00% GC)
  mean time:        214.904 ms (0.00% GC)
  maximum time:     242.207 ms (0.00% GC)
  --------------
  samples:          24
  evals/sample:     1

julia> @benchmark ModelFrame($f, $df)
BenchmarkTools.Trial: 
  memory estimate:  306.38 MiB
  allocs estimate:  201
  --------------
  minimum time:     291.338 ms (0.00% GC)
  median time:      316.318 ms (4.68% GC)
  mean time:        321.147 ms (6.66% GC)
  maximum time:     383.726 ms (22.36% GC)
  --------------
  samples:          16
  evals/sample:     1

old version

julia> status
Status `~/test/Project.toml`
  [6e4b80f9] BenchmarkTools v0.5.0
  [a93c6f00] DataFrames v0.16.0
  [3eaba693] StatsModels v0.5.0
julia> @benchmark ModelMatrix(ModelFrame($f, $df))
BenchmarkTools.Trial: 
  memory estimate:  535.27 MiB
  allocs estimate:  323
  --------------
  minimum time:     208.522 ms (0.00% GC)
  median time:      241.222 ms (11.09% GC)
  mean time:        247.337 ms (12.72% GC)
  maximum time:     325.864 ms (32.72% GC)
  --------------
  samples:          21
  evals/sample:     1

julia> @benchmark ModelFrame($f, $df)
BenchmarkTools.Trial: 
  memory estimate:  230.09 MiB
  allocs estimate:  209
  --------------
  minimum time:     86.592 ms (0.00% GC)
  median time:      101.071 ms (10.76% GC)
  mean time:        104.532 ms (12.39% GC)
  maximum time:     193.014 ms (37.13% GC)
  --------------
  samples:          48
  evals/sample:     1

julia> @benchmark ModelMatrix($(ModelFrame(f, df)))
BenchmarkTools.Trial: 
  memory estimate:  305.18 MiB
  allocs estimate:  114
  --------------
  minimum time:     124.514 ms (0.00% GC)
  median time:      131.682 ms (1.86% GC)
  mean time:        148.649 ms (13.42% GC)
  maximum time:     236.371 ms (42.83% GC)
  --------------
  samples:          34
  evals/sample:     1

JuliaStats / StatsModels.jl

Performances 0.5 vs 0.6 #141

On 0.6.11:

Old version (0.5.0)