Closed juliohm closed 4 years ago
Ok here's a code that works and below some explanations:
using MLJ
@load KMeans
m = KMeans(k=3)
X = MLJ.table(randn(100, 5))
fr, _, _ = MLJ.fit(m, 0, X)
res = MLJ.transform(m, fr, X)
Notes:
There were essentially two issues with your code causing the bug
fit
returns three things: a fitresult
, a cache
and report
; so you tried to feed a tuple of these three things to transform
which only expected the first element (fitresult
)More generally / more importantly, MLJ.fit
is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird.
Consider:
km = Machine(KMeans(k=2), X)
fit!(km)
res = transform(km, X)
PS: we're aware that there needs to be more such tutorials available and they will come, but at the moment we're working on interfacing with a bunch of packages for elementary models first so that people can work with MLJ expecting to find simple things like KMeans, OLS, Ridge, Logistic etc.
Thank you I will look into it during my flight in s few hours. I need access to the low level API because I'm not interested in Machine but other types of experiments inside GeoStats.jl
Regarding the table requirement could you please elaborate ? Why a simple matrix cant be used?
On Sat, Aug 17, 2019, 12:37 Thibaut Lienart notifications@github.com wrote:
Ok here's a code that works and below some explanations:
using MLJ@load KMeans m = KMeans(k=3)
X = MLJ.table(randn(100, 5)) fr, , = MLJ.fit(m, 0, X) res = MLJ.transform(m, fr, X)
Notes:
There were essentially two issues with your code causing the bug
- fit returns three things: a fitresult, a cache and report; so you tried to feed a tuple of these three things to transform which only expected the first element (fitresult)
- you tried to use a matrix throughout; MLJ wants tables; as it needs to keep track of the schema;
More generally / more importantly, MLJ.fit is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird. Consider:
km = Machine(KMeans(k=2), X)fit!(km) res = transform(km, X)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3ODXZ4BJJ6FBBQ4TMTQFALNHA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QN5XY#issuecomment-522247903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3KVKWNIAR5R7BWBCY3QFALNHANCNFSM4IMHS64Q .
The regression API for example works just fine with Julia Arrays
On Sat, Aug 17, 2019, 12:40 Júlio Hoffimann julio.hoffimann@gmail.com wrote:
Thank you I will look into it during my flight in s few hours. I need access to the low level API because I'm not interested in Machine but other types of experiments inside GeoStats.jl
Regarding the table requirement could you please elaborate ? Why a simple matrix cant be used?
On Sat, Aug 17, 2019, 12:37 Thibaut Lienart notifications@github.com wrote:
Ok here's a code that works and below some explanations:
using MLJ@load KMeans m = KMeans(k=3)
X = MLJ.table(randn(100, 5)) fr, , = MLJ.fit(m, 0, X) res = MLJ.transform(m, fr, X)
Notes:
There were essentially two issues with your code causing the bug
- fit returns three things: a fitresult, a cache and report; so you tried to feed a tuple of these three things to transform which only expected the first element (fitresult)
- you tried to use a matrix throughout; MLJ wants tables; as it needs to keep track of the schema;
More generally / more importantly, MLJ.fit is not really meant to be the entry point for users; which is why you were exposed to these issues which you may think are a bit weird. Consider:
km = Machine(KMeans(k=2), X)fit!(km) res = transform(km, X)
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3ODXZ4BJJ6FBBQ4TMTQFALNHA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4QN5XY#issuecomment-522247903, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3KVKWNIAR5R7BWBCY3QFALNHANCNFSM4IMHS64Q .
Could you have a look at https://github.com/alan-turing-institute/MLJ.jl/issues/50#issuecomment-457031719 and subsequent comments for context?
In the present context though, it errors because of a call to MLJBase.table(..., prototype=X)
which doesn't work when istable(X) = false
;
maybe that @ablaom can chip in and say more
Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.
The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).
Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.
(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )
Regarding the name of the output in unsupervised models, I've decided to request a name from the user in GeoStatsBase.jl. I'm far from my computer now but you can check the src/learning folder of the GeoStatsBase.jl repo. There I have a couple of learning task definitions and clustering requires an output name.
I'm considering more complex tasks in the future like the CompositeTask in that repo which is a DAG of tasks that can be parallelized. By forcing tasks to have input and output variable names we can exploit this DAG in a way that variables are created inside the DAG even though they are not present in the data initially.
For example I can create a composite task with clustering as the first step. This clustering will produce a variable name :cluster for example that can be fed in the next step of the DAG. This is in agreement with that comment I made that tasks and data are separate things and that tasks create data on the fly.
On Mon, Aug 19, 2019, 05:34 Anthony Blaom, PhD notifications@github.com wrote:
Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.
The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).
Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.
(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OKIOQKYOQR73GPPA3QFIIGBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RS2JQ#issuecomment-522399014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3JTUGRGLZODEBXLSVDQFIIGBANCNFSM4IMHS64Q .
Regarding the table versus array issue. I'm still confused. For instance, what KMeans needs that an array of features can't provide? Why a table? When we write down the derivation of these algorithms there is never a table right ? Im seeking simplicity assuming that an array of features is general enough to encompass all learning models. Please clarify if this assumption is flawed.
On Mon, Aug 19, 2019, 06:50 Júlio Hoffimann julio.hoffimann@gmail.com wrote:
Regarding the name of the output in unsupervised models, I've decided to request a name from the user in GeoStatsBase.jl. I'm far from my computer now but you can check the src/learning folder of the GeoStatsBase.jl repo. There I have a couple of learning task definitions and clustering requires an output name.
I'm considering more complex tasks in the future like the CompositeTask in that repo which is a DAG of tasks that can be parallelized. By forcing tasks to have input and output variable names we can exploit this DAG in a way that variables are created inside the DAG even though they are not present in the data initially.
For example I can create a composite task with clustering as the first step. This clustering will produce a variable name :cluster for example that can be fed in the next step of the DAG. This is in agreement with that comment I made that tasks and data are separate things and that tasks create data on the fly.
On Mon, Aug 19, 2019, 05:34 Anthony Blaom, PhD notifications@github.com wrote:
Yes, all these "classic" machine learning algorithms currently expect the input to be Tables.jl compatible table.
The current design does not support non-tabular inputs in these cases, but upcoming changes will allow the possibility of allowing both tables and matrices as inputs, provided the implementer of a model's MLJ interface chooses to allow this (and writes the extra logic in his implementation to handle it).
Whether the implementer should do so, is open to debate. If one implementation chooses to allow matrices, then they all should, for consistency, but no doubt all won't.
(A more interesting but separate issue is what the "output type" of the KMeans transformer should be. Currently it is a table but this means we need to invent labels for output columns, which is potential source of name conflicts in model composition, I suppose. )
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OKIOQKYOQR73GPPA3QFIIGBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD4RS2JQ#issuecomment-522399014, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3JTUGRGLZODEBXLSVDQFIIGBANCNFSM4IMHS64Q .
I'm excited about the possibility of allowing "standard Base.Array" objects to work as well. There are many cases, like NLP or Computer vision, where tables fell like a bad approach to do things. I actually think that, for these cases, is where MLJ networks will make the most sence. Decoupling feature generation from model fitting can be very powerfull, and having a machine that "just works" for this type of problems would be fantastic.
Giving MLJ developers the chance to write fit
methods for arrays should not be a problem. In many cases the core of the fit method will actually use a standard Array
object.
While I try to include my MultiClassPerceptron model to MLJ I will go into the details of the framework and probably will be able to give a more robust opinon.
To put us in context, other frameworks, like sklearn, have a line in the fit
method checking whether the input is an array or a dataframe. This method checking, done by the check_array
method actually does more stuff. You can see here, for example, the fit
for a tree.
The check_array
method is a base feature that people is expected to use when coding algorithms for the framework. This method allows users to specify the order in which examples should be selected (row/column wise), whether the fit might allow sparse input or not etc...
here's a stupid question of mine, we might have already discussed the answer to it:
Can we not do it as follows?
The "natural" way to deal with this would be to treat arrays as an object having a scitype "table with numeric/continuous columns and unspecified (but auto-assigned) column names".
I would guess there are various ways to deal with this:
(a) a "wrapper" interface that allow to access an array like a table
(b) dispatch on all model fit
and predict
s: whenver called with array, wrap or convert to table
(c) combining (a) with (b), so the user doesn't see the wrapping and can just pass arrays
This might have the drawback that everything is interpreted as continuous input, but there's no way for MLJ to guess whether some columns should be categorical etc, since it can't read the user's thoughts...
This might have the drawback that everything is interpreted as continuous input, but there's no way for MLJ to guess whether some columns should be categorical etc, since it can't read the user's thoughts...
I'm not sure I see this as a drawback. It is to be expected in most frameworks.
The "contract" with MLJ could be: "some models might do interesting transformations by default if you use a table with different types per column. If you use an array they won't do any fancy transformation since feature values will be considered as numeric/continuous".
To sum up my thoughts:
It's great to have models that have been carefully designed to be used with dataframe like inputs and can take advantage of the extra information in a table.
It's great to allow models to use standard arrays.
The two statements are not incompatible.
(b) dispatch on all model fit and predicts: whenver called with array, wrap or convert to table
Why not
(b') use multiple dispatch: For all models definefit
and predict
taking into account the input type. Define methods for both Arrays and Tables. Make no conversion.
"why not (b')" - too many case distinctions to keep track of, in my opinion, if we expect the user to write array and table methods.
This expectation towards the user is also very difficult to make explicit. An architecture becomes brittle when saying "let's allow the user/extender another alternative to do X" too often, as it dilutes the default.
It's especially a slippery slope once there are multiple case distinctions like this, since suddenly one needs to think carefully about combination cases to ensure interoperability.
A clean design, i.m.o., should not make any architectural distinctions by machine type, on the user side (under the hood it's fine, of course).
"why not (b')" - too many case distinctions to keep track of, in my opinion, if we expect the user to write array and table methods.
In my humble opinion most users won't write code for learning algorithms. They will want code that works in different formats and it's easy to use. This might give a bit more work to MLJ developers but I expect that group to be much smaller.
The more I look at the details the more convinced I get that Arrays should be allowed (somehow). Maybe, at some point, I see ways to get around this and I see no need for Arrays. At the moment I just want to try to push a simple models while I look at the code intricacies.
Flux works with standard Arrays and maybe it's a project to take into consideration in order to bring ideas.
So far I noticed that many implementations end up iterating over an array, not a table, in the fit method. This causes several problems at the moment. The most notable one is that every fit requires double the amount of memory because the data is copied in the statement Xmatrix = MLJBase.matrix(X)
. This happens because, according to the help method of the method matrix
the function does the following:
Convert a table source `X` into an `Matrix`;
or, if `X` is a `AbstractMatrix`, return `X`.
Optimized for column-based sources.`
This happens in the tree
method in MLJModels/DecisionTree
and in the KMeans
implementation. It seems weird that the first thing the code does is to convert the table to an array yet we expect users and developers to use tables by default.
I think we should separate
I'd strongly be of the opinion that we can discuss whether arrays can be passed, but this should not be treated by fundamentally changin the design, e.g., adding array specific interface points.
I didn't have the chance to read carefully all the comments, but I understood that the table interface was adopted because it allows heterogeneous types for the features. For example, the features of a person could be its age (integer), its name (string), and its height (float). Representing this tuple in a matrix X would imply a type Any for the elements of the matrix.
Could you please confirm this is the rationale?
If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.
If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.
This is already the case at present. Supervised models require tables.
I was using supervised models with matrices. Also updated the master branch of MLJ* and now KMeans works with matrices.
On Sun, Aug 25, 2019, 17:55 Anthony Blaom, PhD notifications@github.com wrote:
If the answer to the above question is yes, then we need to be consistent and change the supervised models to only accept tables as well, or any type that accepts the Tables.jl interface.
This is already the case at present. Supervised models require tables.
— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/alan-turing-institute/MLJ.jl/issues/209?email_source=notifications&email_token=AAZQW3OMCNHMSBI35OJU7S3QGLWVBA5CNFSM4IMHS642YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD5C3RFY#issuecomment-524662935, or mute the thread https://github.com/notifications/unsubscribe-auth/AAZQW3INWGOEXCN4NFFPHJLQGLWVBANCNFSM4IMHS64Q .
I also can use MLJ models with Arrays and do fit!
. What I can't do is define aMachine
using a model and then do fit!
to the Machine
. The following error appears:
ERROR: ArgumentError: Container should be a table or sparse table.
[1] scitypes(::Array{Float64,2}) at /Users/david/.julia/packages/MLJBase/vdhww/src/scitypes.jl:89
The funny thing is that currently in MLJBase there is no scitypes.jl
. In any case, in my computer, the error seems to come from the following function:
function scitypes(X)
container_type(X) in [:table, :sparse] ||
throw(ArgumentError("Container should be a table or sparse table. "))
names = schema(X).names
return NamedTuple{names}(scitype_union(selectcols(X, c)) for c in names)
end
Probably adding :array
to the method (container_type(X) in [:table, :sparse, :array] ) this would work. To do so we should also change container_type(X)
.
@davidbp If you give a model from MLJModels a matrix there is no guarantee of obtaining a reliable result. That's the reason you get the (appropriate) warning when you do fit! on a machine. There is no warning when you attempt to interact through the model API directly (using fit) because this is not the interface point for the general user. So use at your own risk.
The scitypes API has been completely overhauled and moved to a separate repo in MLJ 0.3.0 (which requires MLJBase 0.4.0). See NEWS for further details. In particular, this allows for the possibility of the model interface provider specifying non-tabular data as input. (However, what existing models require has not been changed.)
You can inspect the required scitype of target/input for any model in a number of ways. If the model code has been loaded (with, eg, @load) then, in MLJ 0.3.0, you can do input_scitype(DecisionTreeRegressor)
and target_scitype(DecisionTreeRegressor)
. And, whether or not the model is loaded or not, you can do info("DecisionTreeRegressor")
to get all the metadata. (On master, this has changed to model("DecisionTreeRegressor")
.
Sorry I should have mentioned "the API is not restrictive for Models but it is for machines". I should add It works fine for my own model implementations but might not work for all MLJModels (as you point out).
In particular, this allows for the possibility of the model interface provider specifying non-tabular data as input. (However, what existing models require has not been changed.)
That's great! I will take a look when I can.
@ablaom is this still an issue since we now have a MatrixTable
which wraps a matrix.
Yeah, this is an old (if contentious) discussion. Closing.
Describe the bug KMeans
transform
is not working.To Reproduce
Expected behavior The code should work as far as I understand.
Additional context No additional context.
Versions Master branch of MLJModels.jl