Open evanfields opened 3 years ago
I can reproduce. Here are my timings
julia> using GLM, DataFrames
julia> df = DataFrame(rand(100,100));
julia> function f(df, a, b, c)
reg_form = Term(a) ~ Term(b) + Term(c)
return r2(lm(reg_form, df))
end
f (generic function with 1 method)
julia> @time f(df, :x1, :x2, :x3)
7.844389 seconds (22.20 M allocations: 1.386 GiB, 5.80% gc time)
0.002045333822805362
julia> @time f(df, :x1, :x2, :x3)
0.000191 seconds (235 allocations: 38.891 KiB)
0.002045333822805362
julia> @time f(df, :x1, :x2, :x4)
0.167007 seconds (163.14 k allocations: 9.977 MiB, 98.56% compilation time)
0.0015202494005057687
julia> @time f(df, :x1, :x2, :x4)
0.000167 seconds (235 allocations: 38.891 KiB)
0.0015202494005057687
julia> @time f(df, :x1, :x2, :x5)
0.192700 seconds (163.12 k allocations: 9.974 MiB, 8.88% gc time, 98.73% compilation time)
0.041061833597036856
I suspect this has to do with the modelcols or ModelMatrix methods specializing on the data namedtuple (where the names are type parameters). Currently, we implement generic Tables.jl support by coercing the input data to a NamedTuple of vectors before doing anything with it. I wonder whether there's some kind of alternative strategy which would a) avoid the conversion and b) not take such a big compilation hit. Something like [Tables.columns
]()
I think one roadblock for just using getcolumn
or columns
everywhere is that we're also relying on the namedtuple type in order to dispatch (and I suspect avoid method ambiguities) AND to special case handling a single row vs. an entire table (e.g. for interaction terms). But we could get around that with some kind of internal wrapper types (or maybe Tables.jl provides something for this?)
Actually, I think using the Tables.Columns and Tables.Row wrappers would work just fine. They support everything that the NamedTuple does and are IIUC lazy, and also provide dispatch targets.
I've played around with this a bit more and I can't reproduce it using just apply_schema
and modelcols
. I suspected that because modelcols
has the NamedTuple
of the data as one argument it would specialize and trigger re-compilation but it doesn't seem to be the case. Here's what I tried:
julia> function g(df, a, b, c)
reg_form = Term(a) ~ Term(b) + Term(c)
return apply_schema(reg_form, schema(reg_form, df), RegressionModel)
end
g (generic function with 1 method)
julia> @time g(df, :x1, :x2, :x5);
0.122910 seconds (392.04 k allocations: 23.309 MiB, 99.87% compilation time)
julia> @time g(df, :x1, :x2, :x5);
0.000068 seconds (128 allocations: 16.672 KiB)
julia> @time g(df, :x1, :x2, :x6);
0.000065 seconds (128 allocations: 16.672 KiB)
julia> h(df, args...) = modelcols(g(df, args...), df)
h (generic function with 1 method)
julia> @time h(df, :x1, :x2, :x5);
0.190115 seconds (586.57 k allocations: 35.274 MiB, 99.93% compilation time)
julia> @time h(df, :x1, :x2, :x5);
0.000084 seconds (159 allocations: 30.922 KiB)
julia> @time h(df, :x1, :x2, :x6);
0.000088 seconds (159 allocations: 30.922 KiB)
Even the first run with a new formula is fast after any formula with that structure has been compiled once.
So I suspect it has something to do with the ModelMatrix
or ModelFrame
wrappers...
Using a type that does not need to be specialized over and over again would be awesome! Or maybe use @nospecialize
everywhere.
Yeah it's strange...I'd figured that any specialization would hit those paths too but it doesn't seem like it. I'll have to dig into where the specialization is taking place (or, someone will ;)
Unfortunately Tables.Columns has a type parameter for the wrapped table type so I don't think it'll solve the problem in all cases, although it may help with sources that don't have structural information like column names/types in the type.
Unfortunately Tables.Columns has a type parameter for the wrapped table type so I don't think it'll solve the problem in all cases, although it may help with sources that don't have structural information like column names/types in the type.
Yes, but it's actually perfect no? If I pass a DataFrame then it won't specialize, whereas if I passe a ColumnTable it will specialize — that's to be expected.
Btw I think the slowdown comes from missing_omit that creates a new namedtuple type depending on variables in the formula.
Ahhh that's interesting then, and would explain why I'm not hitting it in the tests above. Maybe the specialization was a red herring then. I wonder if there's a generic-tables-compatible way of doing missing omit...
You can do TableOperations.filter
maybe.
There is also skipmissings
to identify all the observations that are missing.
I think it’s still about specialization — it’s just that everything after missing_omit is respecialized to the new dataset. Yes I think the way forward would be to write missing_omit that takes a Table.Columns and create a Table.Colums if it’s possible.
fwiw I think it's likely that @nospecialize
will help in this scenario.
julia> namedtuples = map(1:50) do _
names = rand('a':'z', 10);
v = [Symbol(n) => rand(10) for n in names]
(;v...)
end;
julia> function foo(t)
nms = collect(keys(t))
means = map(mean, collect(values(t)))
return nms .=> means
end;
julia> @time foo(namedtuples[1]);
0.093913 seconds (214.92 k allocations: 12.716 MiB, 99.96% compilation time)
julia> @time foo(namedtuples[1]);
0.000012 seconds (4 allocations: 720 bytes)
julia> @time foo(namedtuples[2]);
0.061658 seconds (160.72 k allocations: 9.307 MiB, 99.95% compilation time)
julia> @time foo(namedtuples[2]);
0.000013 seconds (4 allocations: 704 bytes)
julia> function bar(@nospecialize t)
nms = collect(keys(t))
means = map(mean, collect(values(t)))
return nms .=> means
end;
julia> @time bar(namedtuples[11]);
0.034917 seconds (29.15 k allocations: 1.985 MiB, 99.64% compilation time)
julia> @time bar(namedtuples[11]);
0.000042 seconds (8 allocations: 896 bytes)
julia> @time bar(namedtuples[12]);
0.009303 seconds (3.31 k allocations: 212.256 KiB, 98.54% compilation time)
julia> @time bar(namedtuples[12]);
0.000046 seconds (8 allocations: 800 bytes)
Since Table 1.6 https://github.com/JuliaData/Tables.jl/releases/tag/v1.6.0 Columns
will actually reliably return a Columns
object so we could use that for dispatch. I started playing aroudn with that in #247 but there are some design issues to work out (and I ran into the fact that Columns is a lie, which is now fixed)
Calling
StatsModels.fit
with a not yet seen formula seems to trigger pretty slow compilation, even if a structurally equivalent formula with different names has been seen before. Triggeringfit
with a formula which has been seen before is very fast.The below reproducing example using
GLM
andDataFrames
, and closely mimics how I stumbled upon this issue in the wild. I'm not familiar with the StatsModels/GLM internals, but if this example isn't minimal enough I can try to drill down.