Closed roland-KA closed 2 years ago
Thanks for your query.
There are of course plenty of models that specify Multiclass
, OrderedFactor
(or Finite
which is either) for the target, and in those cases, you are correct that this means the user is passing a categorical vector (single target case) or table of categorical vectors (multi-target case). For example, to see all the models can handle a single Multiclass
target with, say, 3, classes, do:
models() do m
AbstractVector{Finite{3}} <: m.target_scitype
end
end
There aren't many models that handle a table of Multiclass
features, but there are some:
julia> models() do m
Table(Finite{3}) <: m.input_scitype
end
13-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
(name = ConstantClassifier, package_name = MLJModels, ... )
(name = ConstantRegressor, package_name = MLJModels, ... )
(name = ContinuousEncoder, package_name = MLJModels, ... )
(name = DecisionTreeClassifier, package_name = BetaML, ... )
(name = DecisionTreeRegressor, package_name = BetaML, ... )
(name = DeterministicConstantClassifier, package_name = MLJModels, ... )
(name = DeterministicConstantRegressor, package_name = MLJModels, ... )
(name = FeatureSelector, package_name = MLJModels, ... )
(name = FillImputer, package_name = MLJModels, ... )
(name = OneHotEncoder, package_name = MLJModels, ... )
(name = RandomForestClassifier, package_name = BetaML, ... )
(name = RandomForestRegressor, package_name = BetaML, ... )
(name = Standardizer, package_name = MLJModels, ... )
Under the hood some of these models convert the categorical vectors to integer vectors (and back again) but as a categorical array is essentially an array of integers plus metadata, I don't think there's a big performance cost. (You can reduce the cost further by implementing a data "font-end" but I doubt it's worth it unless your model has an iteration parameter, maybe.) The DecisionTree.jl models (not listed) support features that are OrderedFactor
and I don't think there is any conversion, because for the algorithm only the order operator <
is needed.
The advantage of having categorical data arrive as a CategoricalArray
is that you always get the complete pool of classes, even if resampling has hidden some of them. If you haven't already, have a look at this section of "Working with categorical data".
Does this adequately address your query?
Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how ScientificTypes
work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.
But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:
CategoricalArray
s, if the target is declared being of type Multiclass
or OrderedFactor
. Is the use of a CategoricalArray
mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since a CategoricalArray
knows all classes)? ... or would it be also possible to pass a normal Array
?Multiclass
(or OrderedFactor
). If I have a model MyModel
which implements its fit-function my_fit
as follows with a relatively specific data type like AbstractDataFrame
:module MyModel
function my_fit(X::AbstractDataFrame, y::AbstractVector)
...
end
... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of X
might be too specific or does this work as long as I'm using a type which conforms to Tables
? Many models I've seen so far use an AbstractMatrix
at this place. Are there any assumptions made in MLJ about this situation?
Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how
ScientificTypes
work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:
- You write, that users pass
CategoricalArray
s, if the target is declared being of typeMulticlass
orOrderedFactor
. Is the use of aCategoricalArray
mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since aCategoricalArray
knows all classes)? ... or would it be also possible to pass a normalArray
?
In case it's not clear (it probably is) under the hood you are free whatever types you like. What I think we are discussing is here is how the data arrives to the user (for training) and the form data leaves (eg, prediction) which should match where appropriate.
I suppose it's not strictly mandatory to require the target to come in as a CategoricalArray. You just need to be able to articulate your data requirements using scientific types. So you could declare, say, target scitype to be AbstractVector{Count}
(or a union of types to allow more than one kind) which would imply the user passes an AbstractVector{<:Integer}
, but that has three problems: (i) MLJ propaganda is that Count
is for discrete, typically unbounded "frequency" data, so there is a danger of the user misinterpreting the kind of modelling that is happening; (ii) you need a separate mechanism for conveying information about the complete class pool (eg, a separate training argument to fit
)(iii) user confusion around fact that all the other MLJ classifiers declare target scitype to be AbstractVector{<:OrderedFactor}
or AbstractVector{<:Multiclass}
or similar.
Could you say more about why you might not want to force MLJ users to use CategoricalArrays?
And by the way, there's nothing to stop you have a "local" interface which completely dodges the scitype issue, and a separate MLJ interface.
- My second question is about the situation, when the features are of categorical data and thus in MLJ are declared being of type
Multiclass
(orOrderedFactor
). If I have a modelMyModel
which implements its fit-functionmy_fit
as follows with a relatively specific data type likeAbstractDataFrame
:module MyModel function my_fit(X::AbstractDataFrame, y::AbstractVector) ... end
... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of
X
might be too specific or does this work as long as I'm using a type which conforms toTables
? Many models I've seen so far use anAbstractMatrix
at this place. Are there any assumptions made in MLJ about this situation?
As I say, you can use whatever type you like under the hood. However, if you are using AbstractDataFrame
there's a chance your model works for any Tables.jl
compatible table, you just need to drop the type annotation AbstractDataFrame
and declare an input_scitype
of MLJModelInterface.Table(Finite)
, if all columns need to be CategoricalArray
s, say.
Perhaps you want to share more detail about the model you have in mind?
Thank's for your explanations! I've just finished an update to the model which I would like to register with MLJ (roland-KA/OneRule). So we have now an example to look at.
This model uses for the features as well as for the target categorical data. It uses for its internal fit-function (get_best_tree
in trees.jl
) the data types I've mentioned above as follows:
function get_best_tree(X::AbstractDataFrame, y::AbstractVector)
trees = all_trees(X, y)
return(trees[argmin(trees)])
end
But I have defined the data types for use with MLJ as (in OneRule_MLJ.jl
):
MMI.metadata_model(OneRuleClassifier,
input_scitype = MMI.Table(MMI.Finite),
target_scitype = AbstractVector{<: MMI.Finite},
...
So a user of the MLJ interface should pass a DataFrame
with columns of CategoricalArrays
and the target as well as the predictions should be CategoricalArrays
too.
I hope this makes sense? With my questions above, I just wanted to make sure that the first version isn't a complete mess 🤓.
The tests in runtests.jl
show more or less how the model can be used via the MLJ interface. Essentially it is:
using DataFrames
using OneRule
using MLJ
using CategoricalArrays
### create test data
weather = DataFrame(
outlook = ["sunny", "sunny", "overcast", "rainy", "rainy", "rainy", "overcast", "sunny", "sunny", "rainy", "sunny", "overcast", "overcast", "rainy"],
temperature = ["hot", "hot", "hot", "mild", "cool", "cool", "cool", "mild", "cool", "mild", "mild", "mild", "hot", "mild"],
humidity = ["high", "high", "high", "high", "normal", "normal", "normal", "high", "normal", "normal", "normal", "high", "normal", "high"],
windy = ["false", "true", "false", "false", "false", "true", "true", "false", "false", "false", "true", "true", "false", "true"]
)
play = ["no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"]
# create/adapt test data for use via MLJ interface
coerce!(weather, Textual => Multiclass)
play_cat = categorical(play)
# ML workflow
orc = OneRuleClassifier()
mach = machine(orc, weather, play_cat)
fit!(mach)
yhat_cat = MLJ.predict(mach, weather)
fitted_tree = report(mach).tree
Cool. Hopefully, I can take a look next week.
So, I close this issue as the remaining questions are better addressed in the issue you opened on OneRule
.
The discussion here and on roland-KA/OneRule.jl#2 helped me, to understand several aspects of using ScientificTypes
and CategoricalArrays
inside and outside of MLJ much better. Therefore I will summarize here the take-aways for the points where I had difficulties in the beginning. Perhaps it will help other users, to get started faster with these topics.
ScientificTypes
define a type system that is conceptually an abstraction layer on top of the Julia type system (but technically the types defined by ScientificTypes
are ordinary Julia types) that specifically addresses the needs of machine learning.using ScientificTypes
they are just there, ready to use. There is no need to define or declare your data objects or variables for the use with this type system.
scitype
(the same way typeof
is used on the Julia level). At this stage, ScientifcTypes
just tries to infer the scientific type from the Julia type.using MLJ
implicitly loads ScientificTypes
. coerce
or coerce!
. The latter variant can only be applied to tabular data. For all other data, a copy (bearing the added information about the correct scientific type) will be created with coerce
. ScientificTypes
as being of type Count
. Only with the aforementioned domain knowledge it gets clear, that the correct scientific type is OrderedFactor
(and has to be changed explicitly using coerce
).Finite
type (Multiclass
or OrderedFactor
) means that the data concerned gets automatically converted to CategoricalValue
s (and an array containing such values will be converted to a CategoricalArray
).Learning models outside and inside of MLJ:
CategoricalValue
s or CategoricalArray
s when used with its native interface (i.e. outside of MLJ). I.e. it may be implemented so that it can process e.g. arrays of String
. Finite
. String
(which is of scientific type Textual
), won't accept this data when used via its MLJ interface.@roland-KA Thanks indeed for taking the time to document your experience!
I think the synopsis is generally correct. I wouldn't say that scitype
tries to "guess" the scientific type of data. Rather, it associates a scientific type to each julia type according to a specific convention that was decided upon by mostly matching common usage, but which will not match usage in all cases. (And a developer can in principle implement a different convention using ScientificTypesBase.jl)
As you correctly explain, when you implement an interface for an MLJ model, the interface must make type adjustments to account for any mismatch between what a model expects and conceptualises as "multiclass vector", say, and what objects actually have AbstractVector{<:Multiclass}
as scitype under the convention. If the core model expects Vector{<:String}
, say (whose instances have scitype AbstractVector{Textual}
) then the interface, having declared the expected scitype (eg target_scitype
) to be Abstract{<:Multiclass}
, will need to convert the data received by MLJBase.fit
and MLJBase.predict
(typically an unordered categorical vector) into a the Vector{String}
type the internal model requires. (An exception occurs if an implementation overloads an additional "data front end".)
In this way scientific types is simply a way to: (i) enforce a uniformity in the data types that MLJ users present to their models, and (ii) allow the user to focus on the purpose (scientific type) of data rather on the specific machine representation.
Thank's for your clarifying feedback! Step by step I'm working towards a full understanding of the concepts 🤓.
I'm trying to adapt a model for use with MLJ. The features as well as the target used in this model is categorical data.
MLJ uses
ScientificTypes
for all data (andCategoricalArrays
for categorical data). Therefore I'm thinking about using these constructs already in the native model. But I didn't find any existing native models using these constructs. So I'm wondering if there are any disadvantages associated with this approach. What are the pros and cons of usingScientificTypes
andCategoricalArrays
already in a native model (if the model should be integrated with MLJ)?