Closed ablaom closed 3 years ago
The slowdown is due to the fact that the constructor checks the number of rows of y
and X
are the same. However, MLJBase's implementation of rows
to get nrows(X)
first turns X
into a column table. However, materialising a Tables.MatrixTable
as a column table is doomed, if there are a large number of rows. The last line below hangs:
using Tables
X = Tables.table(randn(200,10000));
col_table = Tables.columntable(X);
Is this a Tables issue? No, not really. The problem is that julia named tuples with a large number of keys take forever to construct and I guess are a bad idea anyway (I think large tuples are generally a bad idea in julia?). So, for example, this hangs:
names = tuple((Symbol(string("Column", j)) for j in 1:10000)...);
values = rand(10000);
NamedTuple{names}(values);
MLJBase uses column tables not just in nrows
but also in the fallbacks for selectrows
(essential to resampling) and selectcols
. So just fixing the nrows
implementation is not enough. We have a wider issue here.
See also: #309
Note that in the slack user's use-case the data is not sparse. In any case, for dimension reduction models (eg, PCA) we would not want to restrict to sparse tables.
cc @OkonSamuel
@ablaom. Yes this is a big issue. Maybe for now we could choose a Table type e,g DataFrames
or MatrixTable
and make it more efficient for high dimensional data. Then give efficient definitions for nrow
, selectrows
selectcols
.
Then state somewhere in the docs that users use the selected Table type for high-dimensional data. ?
Sounds like a good workaround for now, thanks!
FWIW I have been using the following snippet ( based on @OkonSamuel's suggestion) to run things locally as a temporary workaround (selectcols
not implemented).
using Tables
import MLJModelInterface
const MMI = MLJModelInterface
MMI.nrows(X::Tables.MatrixTable) = size(MMI.matrix(X), 1)
MMI.selectrows(X::Tables.MatrixTable, ::Colon) = X
MMI.selectrows(X::Tables.MatrixTable, r::Integer) =
MMI.selectrows(X::Tables.MatrixTable, r:r)
function MMI.selectrows(X::Tables.MatrixTable, r)
new_matrix = MMI.matrix(X)[r, :]
_names = getfield(X, :names)
MMI.table(new_matrix; names = _names)
end
Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran machine(model, X, y)
until I killed the session with Ctrl+C (pretty annoying to debug :)).
Is there anything speaking against the workaround mentioned above?
@jbrea Thanks for reporting!
I suspect a different cause here because you are using a DataFrame
(for which the aforementioned methods are already overloaded) not a wrapped matrix. Can you report scitype(X)
and scitype(y)
? If scitype(X)
is hanging, and your dataset is large, this is likely because you have columns of some unusual eltype. In that case, what are the eltypes of your columns? What does MLJ.MLJBase.Tables.schema(X)
return?
Thanks for your response.
The data is all Float64
, scitype(X) == Table{AbstractVector{Continuous}}
, scitype(y) == AbstractVector{Continuous}
.
What hangs is nrows(X) = nrows(get_interface_mode(), vtrait(X), X)
with MLJModelInterface.vtrait(X) == Val{:table}()
, because of Tables.columntable(X)
.
Why isn't just MMI.nrows(X::DataFrame) = DataFrames.nrow(X)
?
Arghh, I just ran into this. For a not so large DataFrame with less rows than columns (708×4870) julia felt like hanging whenever I ran
machine(model, X, y)
until I killed the session with Ctrl+C (pretty annoying to debug :)).Is there anything speaking against the workaround mentioned above?
Your slow code is because MLJ's machine
constructor calls MMI.nrows()
which is currently inefficient for tables with large number of columns.
I guess I'll open a PR to fix this soon.
Thanks to @OkonSamuel, the following addresses the nrows
issue. @jbrea Can you confirm your own issue is resolved?
Yes, my issue is resolved. Thanks a lot for the quick fix!
As reported by slack user, this code is taking way too long: