JuliaAI / MLJBase.jl

Core functionality for the MLJ machine learning framework
MIT License
160 stars 45 forks source link

taking performance issue more seriously #309

Open OkonSamuel opened 4 years ago

OkonSamuel commented 4 years ago

I know MLJ will definitely have some overhead since it's wraps other code. But i believe this overhead can be reduced below the current level with careful design considerations. Avoidable overhead which are neglected may come to hunt us when the code is scaled. There are a few things i discovered which are important.

  1. As pointed out in issue #151 selectrows. This causes overhead when used in evaluate method which calls this method a lot of times depending on the repeats parameter.
  2. In the MLJBase.fit method, the matrix method (which to my understanding copies data) is called on a given table X of course this isn't bad if i call this method only once using the same data changing a couple of model parameters. This becomes important in the evaluate method which calls fit a couple of times depending on repeats .(If X is larger this isn't nice). (I don't think update method does enough justice). Also other methods call these methods repeatedly copying X in each case
  3. The return type to MLJBase.predict for probabilistic Classifiers.( I can't find the link to the issue)

These are just some of the point. I believe there are other things which affects scalability. It is better we start treating this issues more serious. Imagine a case where One wants to embark on a kaggle competition on a large dataset only to find out that the overhead is just unbearable in this case). You may correct me perhaps i'm missing something. @ablaom , @tlienart

tlienart commented 4 years ago

re (2) I raised that a while back but I seem to remember that MatrixTable can be formed without copies but I'm not sure... what's for sure is that in some case (when you have to materialise the transpose for e.g. glm stuff) then there's definitely a copy.

OkonSamuel commented 4 years ago

@tlienart. What do you think? Since we are still going to call matrix(X) in the inner methods. Why don't we don't in some cases call matrix(X) earlier so we can use Xmatrix[inds, :] many times instead of having to call both selectrows and matrix many times. Also Xmatrix[inds, :] should be faster than selectrows. It just a suggestion though??

OkonSamuel commented 4 years ago

In addition i noticed that internally all tables are converted to dense matrices. Are there not cases where we would like to have a sparse matrix.(Maybe if something like a sparse table exists??).

tlienart commented 4 years ago

In addition i noticed that internally all tables are converted to dense matrices.

what code are you referring to here?

I think Anthony will have to chip in to give more clarity here but in terms of the sparse case I believe I tried that when we were working on LightGBM, you can have a Table wrapping around a sparse matrix and recover said sparse matrix (but I'm not certain).

OkonSamuel commented 4 years ago

what code are you referring to here?

Code at MLJModels

you can have a Table wrapping around a sparse matrix and recover said sparse matrix

that's nice if it exists

tlienart commented 4 years ago

we need to be specific here because there's a bunch of different things happening in MLJModels

Assuming I'm correct in the above, then if the table wraps around a sparse matrix, Xarray remains that sparse matrix. Note that very few available algorithms (at this point) can handle sparse data.

ablaom commented 4 years ago

edited The solution in this comment is superseded by #492

Let me address the issue of reconverting tables into matrices every time the rows to be sampled change, which relates to #151. I do not see a non-breaking way to address this, but agree it worthwhile investigating the benefits of a fix.

Here's a suggestion for a change that's probably the easiest to implement for purposes of investigation. There may be other (essentially equivalent) ways that are friendlier to the implementer, but I suggest we hash that out later.

Currently a model implementation plays no role in sampling the data. Row selection is performed in thefit!method (applied to a machine) through the Tables interface. Instead, we give the responsibility over to the implementation by extending the signatures of MLJModelInterface.fit and MLJModelInterface.update to include rows and previous_rows:

fit(model::SomeModelType, verbosity, rows, data...) -> fitresult, cache, report 
update( model::SomeModelType, verbosity, rows, old_rows, old_fitresult, old_cache, old_report, data...) -> fitresult, cache, report

An implementation can already cache the matrix version of data, by returning this in cache. But now, assuming model has not changed (same hyper-parameters) fit! calls update which can sample the rows using the cached matrix (or construct a view if the algorithm being interfaced supports this) rather than re-converting the table. (Previously, if the rows had changed, fit! would do the sampling and call fit, not update, when rows changed.)

Note that we also pass old_rows to update because if rows==old_rows there is no need to repeat the calculation at all - just return old_fitresult, old_cache, old_report (which also explains the addition of old_report to the signature of update.

To evaluate this benefit of this change we could fork MLJModelInterface, MLJBase and, say, EvoTrees, and repeat some of @jeremiedb 's benchmarks, and add some that do a lot of resampling (repeated CV). Just note that update is already performing the role of update iterations there.

Here's a mock up of how an implementation would look:

function solve(Xview::SubArray, yview::SubArray)  = yview \ Xview 

MLJModelInterface.fit(model:MyDeterministicRegressor, verbosity, rows, X, y)
    Xmatrix = Tables.matrix(X)
    Xview = view(X, rows, :)
    view = view(y, rows)
    fitresult = solve(Xview, yview)  # coefficients 
    report = ...
    cache = Xmatrix
    return fitresult, cache, report
end

MLJModelInterface.update(model::MyDeterministicRegressor, verbosity, 
    rows, old_rows, old_fitresult, old_cache, old_report, X, y)
    rows == old_rows && return old_fitresult, old_cache, old_report
    Xmatrix = cache
    Xview = view(Xmatrix, rows)
    yview = view(y, rows)
    fitresult = solve(Xview, yview)
    report = ...
    return fitresult, cache, report
end
OkonSamuel commented 4 years ago

An implementation can already cache the matrix version of data, by returning this in cache. But now, assuming model has not changed (same hyper-parameters) fit! calls update which can sample the rows using the cached matrix (or construct a view if the algorithm being interfaced supports this) rather than re-converting the table. (Previously, if the rows had changed, fit! would do the sampling and call fit, not update, when rows changed.)

I think this might lead to unexpected behavior if the user is using a mutable table type (e.g DataFrame)

OkonSamuel commented 4 years ago

I guess we can have two version of fit! function.

ablaom commented 4 years ago

@OkonSamuel Thanks for looking deeper into this API. Looks like you are making an important observation here.

I think this might lead to unexpected behavior if the user is using a mutable table type (e.g DataFrame)

Are you referring to the fact that a user might mutate data bound to a machine in between a call to fit (dispatched by their first call to fit!) and a call to update (dispatched by subsequent call to fit! on same machine)? If so, this is potentially an issue already with the existing API, right? I admit that I had not imagined users doing this kind of thing. We could warn users in documentation that fit! should be called with force=true if data is mutated. Otherwise, we would have to do something more drastic, such as arranging the machine constructor to make a deep copy of data provided the constructor by default (with option not to copy, eg, with copy=false). MMm. We might need to automatically detect out-of-memory tables, like csv files, that we would not want to copy by default, etc .

What are your thoughts? Or maybe I missed the point of your question?

OkonSamuel commented 4 years ago

Are you referring to the fact that a user might mutate data bound to a machine in between a call to fit (dispatched by their first call to fit!) and a call to update (dispatched by subsequent call to fit! on same machine)?

yes. Thus caching the matrix would be different from the current behaviour in which a call to update uses the mutated data.

OkonSamuel commented 4 years ago

If so, this is potentially an issue already with the existing API, right?

I guess so.

OkonSamuel commented 4 years ago

Otherwise, we would have to do something more drastic, such as arranging the machine constructor to make a deep copy of data provided the constructor by default (with option not to copy, eg, with copy=false)

yeah.

OkonSamuel commented 4 years ago
* `MMI.matrix(X)` is (I think) just a view of the data (someone needs to confirm this),

Okay i looked up code at Tables,jl and found out that this is technically a copy due to allocation.

function matrix(table; transpose::Bool=false)
    cols = columns(table)
    types = schema(cols).types
    T = reduce(promote_type, types)
    n, p = rowcount(cols), length(types)
    if !transpose
        matrix = Matrix{T}(undef, n, p)
        for (i, col) in enumerate(Columns(cols))
            matrix[:, i] = col
        end
    else
        matrix = Matrix{T}(undef, p, n)
        for (i, col) in enumerate(Columns(cols))
            matrix[i, :] = col
        end
    end
    return matrix
end

More proof


julia> X = (x1=[1.,missing,3],x2=[4.,5,6],x3=[7., 8, missing])
(x1 = Union{Missing, Float64}[1.0, missing, 3.0],
 x2 = [4.0, 5.0, 6.0],
 x3 = Union{Missing, Float64}[7.0, 8.0, missing],)

julia> Z =DataFrame(X)
3×3 DataFrame
│ Row │ x1       │ x2      │ x3       │
│     │ Float64? │ Float64 │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1   │ 1.0      │ 4.0     │ 7.0      │
│ 2   │ missing  │ 5.0     │ 8.0      │
│ 3   │ 3.0      │ 6.0     │ missing  │

julia> P = Tables.matrix(Z)
3×3 Array{Union{Missing, Float64},2}:
 1.0       4.0  7.0
  missing  5.0  8.0
 3.0       6.0   missing

julia> Z[!, 3][3]=0
0

julia> P
3×3 Array{Union{Missing, Float64},2}:
 1.0       4.0  7.0
  missing  5.0  8.0
 3.0       6.0   missing
julia> Z
3×3 DataFrame
│ Row │ x1       │ x2      │ x3       │
│     │ Float64? │ Float64 │ Float64? │
├─────┼──────────┼─────────┼──────────┤
│ 1   │ 1.0      │ 4.0     │ 7.0      │
│ 2   │ missing  │ 5.0     │ 8.0      │
│ 3   │ 3.0      │ 6.0     │ 0.0      │

But i guess the copy is needed only if algorithms modify X in place Creating a MatrixTable is pretty cheap.

ablaom commented 3 years ago

edited The solution in this comment is superseded by #492

Further to my suggestion above. If it proves worthwhile, there could be a transition from the old interface to the new, by calling the new methods fit2 and update2 and having fallbacks for these that simply call selectrows on the (outer) tabular data to do the sampling (as now) and then call the existing fit and update implementations for the training.

I will flush out this out a bit and post a concrete POC shortly.

ablaom commented 3 years ago

Okay, I have rethought this yet again and have a simpler and less disruptive solution here