JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.79k stars 158 forks source link

Should some if any "classic" ML algorithms accepts matrix input in addition to table input? #209

Closed juliohm closed 4 years ago

juliohm commented 5 years ago

Describe the bug KMeans transform is not working.

To Reproduce

using MLJ

@load KMeans

m = KMeans()

X = rand(100,2)

theta = MLJ.fit(m, 0, X)

MLJ.transform(m, theta, X)

ERROR: MethodError: no method matching pairwise(::Distances.SqEuclidean, ::LinearAlgebra.Transpose{Float64,Array{Float64,2}}, ::Tuple{Array{Float64,2},Nothing,NamedTuple{(:assignments,),Tuple{Array{Int64,1}}}}) Closest candidates are: pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T; dims) at /Users/julio.hoffimann@ibm.com/.julia/packages/Distances/HOWRG/src/generic.jl:180 pairwise(::Distances.PreMetric, ::AbstractArray{T,2} where T, ::AbstractArray{T,2} where T; dims) at /Users/julio.hoffimann@ibm.com/.julia/packages/Distances/HOWRG/src/generic.jl:170 Stacktrace: [1] transform(::KMeans{Distances.SqEuclidean}, ::Tuple{Array{Float64,2},Nothing,NamedTuple{(:assignments,),Tuple{Array{Int64,1}}}}, ::Array{Float64,2}) at /Users/julio.hoffimann@ibm.com/.julia/dev/MLJModels/src/Clustering.jl:78

Expected behavior The code should work as far as I understand.

Additional context No additional context.

Versions Master branch of MLJModels.jl

tlienart commented 5 years ago

Ok here's a code that works and below some explanations:

using MLJ
@load KMeans
m = KMeans(k=3)

X = MLJ.table(randn(100, 5))
fr, _, _ = MLJ.fit(m, 0, X)
res = MLJ.transform(m, fr, X)

Notes:

There were essentially two issues with your code causing the bug

  1. fit returns three things: a fitresult, a cache and report; so you tried to feed a tuple of these three things to transform which only expected the first element (fitresult)
  2. you tried to use a matrix throughout; MLJ wants tables; as it needs to keep track of the schema;

More generally / more importantly, MLJ.fit is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird. Consider:

km = Machine(KMeans(k=2), X)
fit!(km)
res = transform(km, X)

PS: we're aware that there needs to be more such tutorials available and they will come, but at the moment we're working on interfacing with a bunch of packages for elementary models first so that people can work with MLJ expecting to find simple things like KMeans, OLS, Ridge, Logistic etc.

juliohm commented 5 years ago

Thank you I will look into it during my flight in s few hours. I need access to the low level API because I'm not interested in Machine but other types of experiments inside GeoStats.jl

Regarding the table requirement could you please elaborate ? Why a simple matrix cant be used?

On Sat, Aug 17, 2019, 12:37 Thibaut Lienart notifications@github.com wrote:

Ok here's a code that works and below some explanations:

using MLJ@load KMeans m = KMeans(k=3)

X = MLJ.table(randn(100, 5)) fr, , = MLJ.fit(m, 0, X) res = MLJ.transform(m, fr, X)

Notes:

There were essentially two issues with your code causing the bug

  1. fit returns three things: a fitresult, a cache and report; so you tried to feed a tuple of these three things to transform which only expected the first element (fitresult)
  2. you tried to use a matrix throughout; MLJ wants tables; as it needs to keep track of the schema;

More generally / more importantly, MLJ.fit is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird. Consider:

km = Machine(KMeans(k=2), X)fit!(km) res = transform(km, X)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3ODXZ4BJJ6FBBQ4TMTQFALNHA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QN5XY#issuecomment-522247903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3KVKWNIAR5R7BWBCY3QFALNHANCNFSM4IMHS64Q .

juliohm commented 5 years ago

The regression API for example works just fine with Julia Arrays

On Sat, Aug 17, 2019, 12:40 Júlio Hoffimann julio.hoffimann@gmail.com wrote:

Thank you I will look into it during my flight in s few hours. I need access to the low level API because I'm not interested in Machine but other types of experiments inside GeoStats.jl

Regarding the table requirement could you please elaborate ? Why a simple matrix cant be used?

On Sat, Aug 17, 2019, 12:37 Thibaut Lienart notifications@github.com wrote:

Ok here's a code that works and below some explanations:

using MLJ@load KMeans m = KMeans(k=3)

X = MLJ.table(randn(100, 5)) fr, , = MLJ.fit(m, 0, X) res = MLJ.transform(m, fr, X)

Notes:

There were essentially two issues with your code causing the bug

  1. fit returns three things: a fitresult, a cache and report; so you tried to feed a tuple of these three things to transform which only expected the first element (fitresult)
  2. you tried to use a matrix throughout; MLJ wants tables; as it needs to keep track of the schema;

More generally / more importantly, MLJ.fit is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird. Consider:

km = Machine(KMeans(k=2), X)fit!(km) res = transform(km, X)

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3ODXZ4BJJ6FBBQ4TMTQFALNHA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QN5XY#issuecomment-522247903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3KVKWNIAR5R7BWBCY3QFALNHANCNFSM4IMHS64Q .

tlienart commented 5 years ago

Could you have a look at https://github.com/alan-turing-institute/MLJ.jl/issues/50#issuecomment-457031719 and subsequent comments for context?

In the present context though, it errors because of a call to MLJBase.table(..., prototype=X) which doesn't work when istable(X) = false;

maybe that @ablaom can chip in and say more

ablaom commented 5 years ago

Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.

The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).

Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.

(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )

juliohm commented 5 years ago

Regarding the name of the output in unsupervised models, I've decided to request a name from the user in GeoStatsBase.jl. I'm far from my computer now but you can check the src/learning folder of the GeoStatsBase.jl repo. There I have a couple of learning task definitions and clustering requires an output name.

I'm considering more complex tasks in the future like the CompositeTask in that repo which is a DAG of tasks that can be parallelized. By forcing tasks to have input and output variable names we can exploit this DAG in a way that variables are created inside the DAG even though they are not present in the data initially.

For example I can create a composite task with clustering as the first step. This clustering will produce a variable name :cluster for example that can be fed in the next step of the DAG. This is in agreement with that comment I made that tasks and data are separate things and that tasks create data on the fly.

On Mon, Aug 19, 2019, 05:34 Anthony Blaom, PhD notifications@github.com wrote:

Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.

The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).

Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.

(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OKIOQKYOQR73GPPA3QFIIGBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RS2JQ#issuecomment-522399014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3JTUGRGLZODEBXLSVDQFIIGBANCNFSM4IMHS64Q .

juliohm commented 5 years ago

Regarding the table versus array issue. I'm still confused. For instance, what KMeans needs that an array of features can't provide? Why a table? When we write down the derivation of these algorithms there is never a table right ? Im seeking simplicity assuming that an array of features is general enough to encompass all learning models. Please clarify if this assumption is flawed.

On Mon, Aug 19, 2019, 06:50 Júlio Hoffimann julio.hoffimann@gmail.com wrote:

Regarding the name of the output in unsupervised models, I've decided to request a name from the user in GeoStatsBase.jl. I'm far from my computer now but you can check the src/learning folder of the GeoStatsBase.jl repo. There I have a couple of learning task definitions and clustering requires an output name.

I'm considering more complex tasks in the future like the CompositeTask in that repo which is a DAG of tasks that can be parallelized. By forcing tasks to have input and output variable names we can exploit this DAG in a way that variables are created inside the DAG even though they are not present in the data initially.

For example I can create a composite task with clustering as the first step. This clustering will produce a variable name :cluster for example that can be fed in the next step of the DAG. This is in agreement with that comment I made that tasks and data are separate things and that tasks create data on the fly.

On Mon, Aug 19, 2019, 05:34 Anthony Blaom, PhD notifications@github.com wrote:

Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.

The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).

Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.

(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OKIOQKYOQR73GPPA3QFIIGBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RS2JQ#issuecomment-522399014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3JTUGRGLZODEBXLSVDQFIIGBANCNFSM4IMHS64Q .

davidbp commented 5 years ago

I'm excited about the possibility of allowing "standard Base.Array" objects to work as well. There are many cases, like NLP or Computer vision, where tables fell like a bad approach to do things. I actually think that, for these cases, is where MLJ networks will make the most sence. Decoupling feature generation from model fitting can be very powerfull, and having a machine that "just works" for this type of problems would be fantastic.

Giving MLJ developers the chance to write fit methods for arrays should not be a problem. In many cases the core of the fit method will actually use a standard Array object.

While I try to include my MultiClassPerceptron model to MLJ I will go into the details of the framework and probably will be able to give a more robust opinon.

Sklearn context

To put us in context, other frameworks, like sklearn, have a line in the fit method checking whether the input is an array or a dataframe. This method checking, done by the check_array method actually does more stuff. You can see here, for example, the fit for a tree.

The check_array method is a base feature that people is expected to use when coding algorithms for the framework. This method allows users to specify the order in which examples should be selected (row/column wise), whether the fit might allow sparse input or not etc...

fkiraly commented 5 years ago

here's a stupid question of mine, we might have already discussed the answer to it:

Can we not do it as follows?

The "natural" way to deal with this would be to treat arrays as an object having a scitype "table with numeric/continuous columns and unspecified (but auto-assigned) column names".

I would guess there are various ways to deal with this: (a) a "wrapper" interface that allow to access an array like a table (b) dispatch on all model fit and predicts: whenver called with array, wrap or convert to table (c) combining (a) with (b), so the user doesn't see the wrapping and can just pass arrays

fkiraly commented 5 years ago

This might have the drawback that everything is interpreted as continuous input, but there's no way for MLJ to guess whether some columns should be categorical etc, since it can't read the user's thoughts...

davidbp commented 5 years ago
This might have the drawback that everything is interpreted as continuous input, but there's no way for MLJ to guess whether some columns should be categorical etc, since it can't read the user's thoughts...

I'm not sure I see this as a drawback. It is to be expected in most frameworks.

The "contract" with MLJ could be: "some models might do interesting transformations by default if you use a table with different types per column. If you use an array they won't do any fancy transformation since feature values will be considered as numeric/continuous".

To sum up my thoughts:

The two statements are not incompatible.

About (b)

(b) dispatch on all model fit and predicts: whenver called with array, wrap or convert to table

Why not

(b') use multiple dispatch: For all models definefit and predict taking into account the input type. Define methods for both Arrays and Tables. Make no conversion.

fkiraly commented 5 years ago

"why not (b')" - too many case distinctions to keep track of, in my opinion, if we expect the user to write array and table methods.

This expectation towards the user is also very difficult to make explicit. An architecture becomes brittle when saying "let's allow the user/extender another alternative to do X" too often, as it dilutes the default.

It's especially a slippery slope once there are multiple case distinctions like this, since suddenly one needs to think carefully about combination cases to ensure interoperability.

A clean design, i.m.o., should not make any architectural distinctions by machine type, on the user side (under the hood it's fine, of course).

davidbp commented 5 years ago

"why not (b')" - too many case distinctions to keep track of, in my opinion, if we expect the user to write array and table methods.

In my humble opinion most users won't write code for learning algorithms. They will want code that works in different formats and it's easy to use. This might give a bit more work to MLJ developers but I expect that group to be much smaller.

The more I look at the details the more convinced I get that Arrays should be allowed (somehow). Maybe, at some point, I see ways to get around this and I see no need for Arrays. At the moment I just want to try to push a simple models while I look at the code intricacies.

Flux works with standard Arrays and maybe it's a project to take into consideration in order to bring ideas.

Another problem with tables

So far I noticed that many implementations end up iterating over an array, not a table, in the fit method. This causes several problems at the moment. The most notable one is that every fit requires double the amount of memory because the data is copied in the statement Xmatrix = MLJBase.matrix(X). This happens because, according to the help method of the method matrix the function does the following:

 Convert a table source `X` into an `Matrix`; 
or, if `X` is a `AbstractMatrix`, return `X`.
 Optimized for column-based sources.`

This happens in the tree method in MLJModels/DecisionTree and in the KMeans implementation. It seems weird that the first thing the code does is to convert the table to an array yet we expect users and developers to use tables by default.

fkiraly commented 5 years ago

I think we should separate

I'd strongly be of the opinion that we can discuss whether arrays can be passed, but this should not be treated by fundamentally changin the design, e.g., adding array specific interface points.

juliohm commented 5 years ago

I didn't have the chance to read carefully all the comments, but I understood that the table interface was adopted because it allows heterogeneous types for the features. For example, the features of a person could be its age (integer), its name (string), and its height (float). Representing this tuple in a matrix X would imply a type Any for the elements of the matrix.

Could you please confirm this is the rationale?

If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.

ablaom commented 5 years ago

If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.

This is already the case at present. Supervised models require tables.

juliohm commented 5 years ago

I was using supervised models with matrices. Also updated the master branch of MLJ* and now KMeans works with matrices.

On Sun, Aug 25, 2019, 17:55 Anthony Blaom, PhD notifications@github.com wrote:

If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.

This is already the case at present. Supervised models require tables.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OMCNHMSBI35OJU7S3QGLWVBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5C3RFY#issuecomment-524662935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3INWGOEXCN4NFFPHJLQGLWVBANCNFSM4IMHS64Q .

davidbp commented 5 years ago

I also can use MLJ models with Arrays and do fit! . What I can't do is define aMachine using a model and then do fit! to the Machine. The following error appears:

ERROR: ArgumentError: Container should be a table or sparse table. 

 [1] scitypes(::Array{Float64,2}) at /Users/david/.julia/packages/MLJBase/vdhww/src/scitypes.jl:89

The funny thing is that currently in MLJBase there is no scitypes.jl. In any case, in my computer, the error seems to come from the following function:

function scitypes(X)
    container_type(X) in [:table, :sparse] ||
        throw(ArgumentError("Container should be a table or sparse table. "))
    names =    schema(X).names
    return NamedTuple{names}(scitype_union(selectcols(X, c)) for c in names)
end

Probably adding :array to the method (container_type(X) in [:table, :sparse, :array] ) this would work. To do so we should also change container_type(X).

ablaom commented 5 years ago

@davidbp If you give a model from MLJModels a matrix there is no guarantee of obtaining a reliable result. That's the reason you get the (appropriate) warning when you do fit! on a machine. There is no warning when you attempt to interact through the model API directly (using fit) because this is not the interface point for the general user. So use at your own risk.

The scitypes API has been completely overhauled and moved to a separate repo in MLJ 0.3.0 (which requires MLJBase 0.4.0). See NEWS for further details. In particular, this allows for the possibility of the model interface provider specifying non-tabular data as input. (However, what existing models require has not been changed.)

You can inspect the required scitype of target/input for any model in a number of ways. If the model code has been loaded (with, eg, @load) then, in MLJ 0.3.0, you can do input_scitype(DecisionTreeRegressor) and target_scitype(DecisionTreeRegressor). And, whether or not the model is loaded or not, you can do info("DecisionTreeRegressor") to get all the metadata. (On master, this has changed to model("DecisionTreeRegressor").

davidbp commented 5 years ago

Sorry I should have mentioned "the API is not restrictive for Models but it is for machines". I should add It works fine for my own model implementations but might not work for all MLJModels (as you point out).

In particular, this allows for the possibility of the model interface provider specifying non-tabular data as input. (However, what existing models require has not been changed.)

That's great! I will take a look when I can.

OkonSamuel commented 4 years ago

@ablaom is this still an issue since we now have a MatrixTable which wraps a matrix.

ablaom commented 4 years ago

Yeah, this is an old (if contentious) discussion. Closing.