Slow machine construction for large number of features

ablaom commented 4 years ago

As reported by slack user, this code is taking way too long:

using MLJ
X = MLJ.table(randn(200,10000));
y = randn(200)
model = ConstantRegressor()
mach = machine(model,  X, y);

ablaom commented 4 years ago

The slowdown is due to the fact that the constructor checks the number of rows of y and X are the same. However, MLJBase's implementation of rows to get nrows(X) first turns X into a column table. However, materialising a Tables.MatrixTable as a column table is doomed, if there are a large number of rows. The last line below hangs:

using Tables

X = Tables.table(randn(200,10000));
col_table = Tables.columntable(X);

Is this a Tables issue? No, not really. The problem is that julia named tuples with a large number of keys take forever to construct and I guess are a bad idea anyway (I think large tuples are generally a bad idea in julia?). So, for example, this hangs:

names = tuple((Symbol(string("Column", j)) for j in 1:10000)...);
values = rand(10000);
NamedTuple{names}(values);

MLJBase uses column tables not just in nrows but also in the fallbacks for selectrows (essential to resampling) and selectcols. So just fixing the nrows implementation is not enough. We have a wider issue here.

See also: #309

Note that in the slack user's use-case the data is not sparse. In any case, for dimension reduction models (eg, PCA) we would not want to restrict to sparse tables.

cc @OkonSamuel

OkonSamuel commented 4 years ago

@ablaom. Yes this is a big issue. Maybe for now we could choose a Table type e,g DataFrames or MatrixTable and make it more efficient for high dimensional data. Then give efficient definitions for nrow, selectrows selectcols. Then state somewhere in the docs that users use the selected Table type for high-dimensional data. ?

ablaom commented 4 years ago

Sounds like a good workaround for now, thanks!

nignatiadis commented 3 years ago

FWIW I have been using the following snippet ( based on @OkonSamuel's suggestion) to run things locally as a temporary workaround (selectcols not implemented).

using Tables
import MLJModelInterface
const MMI = MLJModelInterface

MMI.nrows(X::Tables.MatrixTable) = size(MMI.matrix(X), 1)
MMI.selectrows(X::Tables.MatrixTable, ::Colon) = X
MMI.selectrows(X::Tables.MatrixTable, r::Integer) =
    MMI.selectrows(X::Tables.MatrixTable, r:r)
function MMI.selectrows(X::Tables.MatrixTable, r)
    new_matrix = MMI.matrix(X)[r, :]
    _names = getfield(X, :names)
    MMI.table(new_matrix; names = _names)
end

jbrea commented 3 years ago

Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran machine(model, X, y) until I killed the session with Ctrl+C (pretty annoying to debug :)).

Is there anything speaking against the workaround mentioned above?

ablaom commented 3 years ago

@jbrea Thanks for reporting!

I suspect a different cause here because you are using a DataFrame (for which the aforementioned methods are already overloaded) not a wrapped matrix. Can you report scitype(X) and scitype(y)? If scitype(X) is hanging, and your dataset is large, this is likely because you have columns of some unusual eltype. In that case, what are the eltypes of your columns? What does MLJ.MLJBase.Tables.schema(X) return?

jbrea commented 3 years ago

Thanks for your response.

The data is all Float64, scitype(X) == Table{AbstractVector{Continuous}}, scitype(y) == AbstractVector{Continuous}. What hangs is nrows(X) = nrows(get_interface_mode(), vtrait(X), X) with MLJModelInterface.vtrait(X) == Val{:table}(), because of Tables.columntable(X).

Why isn't just MMI.nrows(X::DataFrame) = DataFrames.nrow(X)?

OkonSamuel commented 3 years ago

Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran machine(model, X, y) until I killed the session with Ctrl+C (pretty annoying to debug :)).

Is there anything speaking against the workaround mentioned above?

Your slow code is because MLJ's machine constructor calls MMI.nrows() which is currently inefficient for tables with large number of columns. I guess I'll open a PR to fix this soon.

ablaom commented 3 years ago

Thanks to @OkonSamuel, the following addresses the nrows issue. @jbrea Can you confirm your own issue is resolved?

https://github.com/JuliaRegistries/General/pull/40183

jbrea commented 3 years ago

Yes, my issue is resolved. Thanks a lot for the quick fix!

JuliaAI / MLJBase.jl

Slow machine construction for large number of features #428