Open OkonSamuel opened 4 years ago
re (2) I raised that a while back but I seem to remember that MatrixTable can be formed without copies but I'm not sure... what's for sure is that in some case (when you have to materialise the transpose for e.g. glm stuff) then there's definitely a copy.
@tlienart. What do you think?
Since we are still going to call matrix(X)
in the inner methods. Why don't we don't in some cases call matrix(X)
earlier so we can use Xmatrix[inds, :]
many times instead of having to call both selectrows
and matrix
many times. Also Xmatrix[inds, :]
should be faster than selectrows
.
It just a suggestion though??
In addition i noticed that internally all tables are converted to dense matrices. Are there not cases where we would like to have a sparse matrix.(Maybe if something like a sparse table exists??).
In addition i noticed that internally all tables are converted to dense matrices.
what code are you referring to here?
I think Anthony will have to chip in to give more clarity here but in terms of the sparse case I believe I tried that when we were working on LightGBM, you can have a Table wrapping around a sparse matrix and recover said sparse matrix (but I'm not certain).
what code are you referring to here?
Code at MLJModels
you can have a Table wrapping around a sparse matrix and recover said sparse matrix
that's nice if it exists
we need to be specific here because there's a bunch of different things happening in MLJModels
Xarray = MMI.matrix(X, transpose=true)
this, as far as I'm aware, does a full copy (bad), but this is required by some shitty algorithms which require pxn
data, we can't do anything about thatXarray = MMI.matrix(X)
or Xarray = transpose(MMI.matrix(X))
this potentially does not do a full copy, MMI.matrix(X)
is (I think) just a view of the data (someone needs to confirm this), Assuming I'm correct in the above, then if the table wraps around a sparse matrix, Xarray
remains that sparse matrix. Note that very few available algorithms (at this point) can handle sparse data.
edited The solution in this comment is superseded by #492
Let me address the issue of reconverting tables into matrices every time the rows to be sampled change, which relates to #151. I do not see a non-breaking way to address this, but agree it worthwhile investigating the benefits of a fix.
Here's a suggestion for a change that's probably the easiest to implement for purposes of investigation. There may be other (essentially equivalent) ways that are friendlier to the implementer, but I suggest we hash that out later.
Currently a model implementation plays no role in sampling the data. Row selection is performed in thefit!
method (applied to a machine) through the Tables interface. Instead, we give the responsibility over to the implementation by extending the signatures of MLJModelInterface.fit
and MLJModelInterface.update
to include rows
and previous_rows
:
fit(model::SomeModelType, verbosity, rows, data...) -> fitresult, cache, report
update( model::SomeModelType, verbosity, rows, old_rows, old_fitresult, old_cache, old_report, data...) -> fitresult, cache, report
An implementation can already cache the matrix version of data
, by returning this in cache
. But now, assuming model
has not changed (same hyper-parameters) fit!
calls update
which can sample the rows using the cached matrix (or construct a view
if the algorithm being interfaced supports this) rather than re-converting the table. (Previously, if the rows had changed, fit!
would do the sampling and call fit
, not update
, when rows
changed.)
Note that we also pass old_rows
to update
because if rows==old_rows
there is no need to repeat the calculation at all - just return old_fitresult, old_cache, old_report
(which also explains the addition of old_report
to the signature of update
.
To evaluate this benefit of this change we could fork MLJModelInterface, MLJBase and, say, EvoTrees, and repeat some of @jeremiedb 's benchmarks, and add some that do a lot of resampling (repeated CV). Just note that update
is already performing the role of update iterations there.
Here's a mock up of how an implementation would look:
function solve(Xview::SubArray, yview::SubArray) = yview \ Xview
MLJModelInterface.fit(model:MyDeterministicRegressor, verbosity, rows, X, y)
Xmatrix = Tables.matrix(X)
Xview = view(X, rows, :)
view = view(y, rows)
fitresult = solve(Xview, yview) # coefficients
report = ...
cache = Xmatrix
return fitresult, cache, report
end
MLJModelInterface.update(model::MyDeterministicRegressor, verbosity,
rows, old_rows, old_fitresult, old_cache, old_report, X, y)
rows == old_rows && return old_fitresult, old_cache, old_report
Xmatrix = cache
Xview = view(Xmatrix, rows)
yview = view(y, rows)
fitresult = solve(Xview, yview)
report = ...
return fitresult, cache, report
end
An implementation can already cache the matrix version of
data
, by returning this incache
. But now, assumingmodel
has not changed (same hyper-parameters)fit!
callsupdate
which can sample the rows using the cached matrix (or construct aview
if the algorithm being interfaced supports this) rather than re-converting the table. (Previously, if the rows had changed,fit!
would do the sampling and callfit
, notupdate
, whenrows
changed.)
I think this might lead to unexpected behavior if the user is using a mutable table type (e.g DataFrame)
I guess we can have two version of fit!
function.
fit_!
) used only by MLJ (which might have the implementation you suggested above) and fit!
) exposed to users.??@OkonSamuel Thanks for looking deeper into this API. Looks like you are making an important observation here.
I think this might lead to unexpected behavior if the user is using a mutable table type (e.g DataFrame)
Are you referring to the fact that a user might mutate data bound to a machine in between a call to fit
(dispatched by their first call to fit!
) and a call to update
(dispatched by subsequent call to fit!
on same machine)? If so, this is potentially an issue already with the existing API, right? I admit that I had not imagined users doing this kind of thing. We could warn users in documentation that fit!
should be called with force=true
if data is mutated. Otherwise, we would have to do something more drastic, such as arranging the machine constructor to make a deep copy of data provided the constructor by default (with option not to copy, eg, with copy=false
). MMm. We might need to automatically detect out-of-memory tables, like csv files, that we would not want to copy by default, etc .
What are your thoughts? Or maybe I missed the point of your question?
Are you referring to the fact that a user might mutate data bound to a machine in between a call to
fit
(dispatched by their first call tofit!
) and a call toupdate
(dispatched by subsequent call tofit!
on same machine)?
yes. Thus caching the matrix would be different from the current behaviour in which a call to update
uses the mutated data.
If so, this is potentially an issue already with the existing API, right?
I guess so.
Otherwise, we would have to do something more drastic, such as arranging the machine constructor to make a deep copy of data provided the constructor by default (with option not to copy, eg, with
copy=false
)
yeah.
* `MMI.matrix(X)` is (I think) just a view of the data (someone needs to confirm this),
Okay i looked up code at Tables,jl and found out that this is technically a copy due to allocation.
function matrix(table; transpose::Bool=false)
cols = columns(table)
types = schema(cols).types
T = reduce(promote_type, types)
n, p = rowcount(cols), length(types)
if !transpose
matrix = Matrix{T}(undef, n, p)
for (i, col) in enumerate(Columns(cols))
matrix[:, i] = col
end
else
matrix = Matrix{T}(undef, p, n)
for (i, col) in enumerate(Columns(cols))
matrix[i, :] = col
end
end
return matrix
end
More proof
julia> X = (x1=[1.,missing,3],x2=[4.,5,6],x3=[7., 8, missing])
(x1 = Union{Missing, Float64}[1.0, missing, 3.0],
x2 = [4.0, 5.0, 6.0],
x3 = Union{Missing, Float64}[7.0, 8.0, missing],)
julia> Z =DataFrame(X)
3×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64? │ Float64 │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 4.0 │ 7.0 │
│ 2 │ missing │ 5.0 │ 8.0 │
│ 3 │ 3.0 │ 6.0 │ missing │
julia> P = Tables.matrix(Z)
3×3 Array{Union{Missing, Float64},2}:
1.0 4.0 7.0
missing 5.0 8.0
3.0 6.0 missing
julia> Z[!, 3][3]=0
0
julia> P
3×3 Array{Union{Missing, Float64},2}:
1.0 4.0 7.0
missing 5.0 8.0
3.0 6.0 missing
julia> Z
3×3 DataFrame
│ Row │ x1 │ x2 │ x3 │
│ │ Float64? │ Float64 │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1 │ 1.0 │ 4.0 │ 7.0 │
│ 2 │ missing │ 5.0 │ 8.0 │
│ 3 │ 3.0 │ 6.0 │ 0.0 │
But i guess the copy is needed only if algorithms modify X
in place
Creating a MatrixTable
is pretty cheap.
edited The solution in this comment is superseded by #492
Further to my suggestion above. If it proves worthwhile, there could be a transition from the old interface to the new, by calling the new methods fit2
and update2
and having fallbacks for these that simply call selectrows
on the (outer) tabular data to do the sampling (as now) and then call the existing fit
and update
implementations for the training.
I will flush out this out a bit and post a concrete POC shortly.
I know MLJ will definitely have some overhead since it's wraps other code. But i believe this overhead can be reduced below the current level with careful design considerations. Avoidable overhead which are neglected may come to hunt us when the code is scaled. There are a few things i discovered which are important.
selectrows
. This causes overhead when used inevaluate
method which calls this method a lot of times depending on therepeats
parameter.MLJBase.fit
method, thematrix
method (which to my understanding copies data) is called on a given tableX
of course this isn't bad if i call this method only once using the same data changing a couple of model parameters. This becomes important in theevaluate
method which callsfit
a couple of times depending onrepeats
.(If X is larger this isn't nice). (I don't thinkupdate
method does enough justice). Also other methods call these methods repeatedly copying X in each caseMLJBase.predict
for probabilistic Classifiers.( I can't find the link to the issue)These are just some of the point. I believe there are other things which affects scalability. It is better we start treating this issues more serious. Imagine a case where One wants to embark on a kaggle competition on a large dataset only to find out that the overhead is just unbearable in this case). You may correct me perhaps i'm missing something. @ablaom , @tlienart