JuliaAI / MLJ.jl

A Julia machine learning framework
https://juliaai.github.io/MLJ.jl/
Other
1.8k stars 158 forks source link

Use of `ScientificTypes` and `CategoricalArrays` in native model #907

Closed roland-KA closed 2 years ago

roland-KA commented 2 years ago

I'm trying to adapt a model for use with MLJ. The features as well as the target used in this model is categorical data.

MLJ uses ScientificTypes for all data (and CategoricalArrays for categorical data). Therefore I'm thinking about using these constructs already in the native model. But I didn't find any existing native models using these constructs. So I'm wondering if there are any disadvantages associated with this approach. What are the pros and cons of using ScientificTypes and CategoricalArrays already in a native model (if the model should be integrated with MLJ)?

ablaom commented 2 years ago

Thanks for your query.

There are of course plenty of models that specify Multiclass, OrderedFactor (or Finite which is either) for the target, and in those cases, you are correct that this means the user is passing a categorical vector (single target case) or table of categorical vectors (multi-target case). For example, to see all the models can handle a single Multiclass target with, say, 3, classes, do:

models() do m
         AbstractVector{Finite{3}} <: m.target_scitype
       end
end

There aren't many models that handle a table of Multiclass features, but there are some:

julia> models() do m
         Table(Finite{3}) <: m.input_scitype
       end
13-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype), T} where T<:Tuple}:
 (name = ConstantClassifier, package_name = MLJModels, ... )
 (name = ConstantRegressor, package_name = MLJModels, ... )
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = DecisionTreeClassifier, package_name = BetaML, ... )
 (name = DecisionTreeRegressor, package_name = BetaML, ... )
 (name = DeterministicConstantClassifier, package_name = MLJModels, ... )
 (name = DeterministicConstantRegressor, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = RandomForestClassifier, package_name = BetaML, ... )
 (name = RandomForestRegressor, package_name = BetaML, ... )
 (name = Standardizer, package_name = MLJModels, ... )

Under the hood some of these models convert the categorical vectors to integer vectors (and back again) but as a categorical array is essentially an array of integers plus metadata, I don't think there's a big performance cost. (You can reduce the cost further by implementing a data "font-end" but I doubt it's worth it unless your model has an iteration parameter, maybe.) The DecisionTree.jl models (not listed) support features that are OrderedFactor and I don't think there is any conversion, because for the algorithm only the order operator < is needed.

The advantage of having categorical data arrive as a CategoricalArray is that you always get the complete pool of classes, even if resampling has hidden some of them. If you haven't already, have a look at this section of "Working with categorical data".

Does this adequately address your query?

roland-KA commented 2 years ago

Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how ScientificTypes work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.

But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:

  1. You write, that users pass CategoricalArrays, if the target is declared being of type Multiclass or OrderedFactor. Is the use of a CategoricalArray mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since a CategoricalArray knows all classes)? ... or would it be also possible to pass a normal Array?
  2. My second question is about the situation, when the features are of categorical data and thus in MLJ are declared being of type Multiclass (or OrderedFactor). If I have a model MyModel which implements its fit-function my_fit as follows with a relatively specific data type like AbstractDataFrame:
module MyModel

function my_fit(X::AbstractDataFrame, y::AbstractVector)
    ...
end

... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of X might be too specific or does this work as long as I'm using a type which conforms to Tables? Many models I've seen so far use an AbstractMatrix at this place. Are there any assumptions made in MLJ about this situation?

ablaom commented 2 years ago

Thank's for the comprehensive answer! In the meantime I've learned that I hadn't a complete understanding of how ScientificTypes work behind the scenes. Therefore a part of my question didn't make so much sense probably 😬.

But there are still a few details about handling types outside and inside of MLJ which I don't understand yet:

  1. You write, that users pass CategoricalArrays, if the target is declared being of type Multiclass or OrderedFactor. Is the use of a CategoricalArray mandatory in this case or is it just more efficient (since the different classes are stored only once) and more information bearing (since a CategoricalArray knows all classes)? ... or would it be also possible to pass a normal Array?

In case it's not clear (it probably is) under the hood you are free whatever types you like. What I think we are discussing is here is how the data arrives to the user (for training) and the form data leaves (eg, prediction) which should match where appropriate.

I suppose it's not strictly mandatory to require the target to come in as a CategoricalArray. You just need to be able to articulate your data requirements using scientific types. So you could declare, say, target scitype to be AbstractVector{Count} (or a union of types to allow more than one kind) which would imply the user passes an AbstractVector{<:Integer}, but that has three problems: (i) MLJ propaganda is that Count is for discrete, typically unbounded "frequency" data, so there is a danger of the user misinterpreting the kind of modelling that is happening; (ii) you need a separate mechanism for conveying information about the complete class pool (eg, a separate training argument to fit)(iii) user confusion around fact that all the other MLJ classifiers declare target scitype to be AbstractVector{<:OrderedFactor} or AbstractVector{<:Multiclass} or similar.

Could you say more about why you might not want to force MLJ users to use CategoricalArrays?

And by the way, there's nothing to stop you have a "local" interface which completely dodges the scitype issue, and a separate MLJ interface.

  1. My second question is about the situation, when the features are of categorical data and thus in MLJ are declared being of type Multiclass (or OrderedFactor). If I have a model MyModel which implements its fit-function my_fit as follows with a relatively specific data type like AbstractDataFrame:
module MyModel

function my_fit(X::AbstractDataFrame, y::AbstractVector)
    ...
end

... and I want to integrate this model into MLJ (i.e. register it as a new model in MLJ). Am I running into any trouble because the type of X might be too specific or does this work as long as I'm using a type which conforms to Tables? Many models I've seen so far use an AbstractMatrix at this place. Are there any assumptions made in MLJ about this situation?

As I say, you can use whatever type you like under the hood. However, if you are using AbstractDataFrame there's a chance your model works for any Tables.jl compatible table, you just need to drop the type annotation AbstractDataFrame and declare an input_scitype of MLJModelInterface.Table(Finite), if all columns need to be CategoricalArrays, say.

Perhaps you want to share more detail about the model you have in mind?

roland-KA commented 2 years ago

Thank's for your explanations! I've just finished an update to the model which I would like to register with MLJ (roland-KA/OneRule). So we have now an example to look at.

This model uses for the features as well as for the target categorical data. It uses for its internal fit-function (get_best_tree in trees.jl) the data types I've mentioned above as follows:

function get_best_tree(X::AbstractDataFrame, y::AbstractVector)
    trees = all_trees(X, y)
    return(trees[argmin(trees)])
end

But I have defined the data types for use with MLJ as (in OneRule_MLJ.jl):

MMI.metadata_model(OneRuleClassifier,
    input_scitype    = MMI.Table(MMI.Finite),
    target_scitype   = AbstractVector{<: MMI.Finite},
...

So a user of the MLJ interface should pass a DataFrame with columns of CategoricalArrays and the target as well as the predictions should be CategoricalArrays too.

I hope this makes sense? With my questions above, I just wanted to make sure that the first version isn't a complete mess 🤓.

The tests in runtests.jl show more or less how the model can be used via the MLJ interface. Essentially it is:

using DataFrames
using OneRule
using MLJ
using CategoricalArrays

### create test data 

weather = DataFrame(
    outlook = ["sunny", "sunny", "overcast", "rainy", "rainy", "rainy", "overcast", "sunny", "sunny", "rainy",  "sunny", "overcast", "overcast", "rainy"],
    temperature = ["hot", "hot", "hot", "mild", "cool", "cool", "cool", "mild", "cool", "mild", "mild", "mild", "hot", "mild"],
    humidity = ["high", "high", "high", "high", "normal", "normal", "normal", "high", "normal", "normal", "normal", "high", "normal", "high"],
    windy = ["false", "true", "false", "false", "false", "true", "true", "false", "false", "false", "true", "true", "false", "true"]
)

play = ["no", "no", "yes", "yes", "yes", "no", "yes", "no", "yes", "yes", "yes", "yes", "yes", "no"]

# create/adapt test data for use via MLJ interface
coerce!(weather, Textual => Multiclass)
play_cat = categorical(play)

# ML workflow
orc = OneRuleClassifier()
mach = machine(orc, weather, play_cat)
fit!(mach)
yhat_cat = MLJ.predict(mach, weather)
fitted_tree = report(mach).tree
ablaom commented 2 years ago

Cool. Hopefully, I can take a look next week.

ablaom commented 2 years ago

https://github.com/roland-KA/OneRule.jl/issues/2

roland-KA commented 2 years ago

So, I close this issue as the remaining questions are better addressed in the issue you opened on OneRule.

roland-KA commented 2 years ago

The discussion here and on roland-KA/OneRule.jl#2 helped me, to understand several aspects of using ScientificTypes and CategoricalArrays inside and outside of MLJ much better. Therefore I will summarize here the take-aways for the points where I had difficulties in the beginning. Perhaps it will help other users, to get started faster with these topics.

Learning models outside and inside of MLJ:

ablaom commented 2 years ago

@roland-KA Thanks indeed for taking the time to document your experience!

I think the synopsis is generally correct. I wouldn't say that scitype tries to "guess" the scientific type of data. Rather, it associates a scientific type to each julia type according to a specific convention that was decided upon by mostly matching common usage, but which will not match usage in all cases. (And a developer can in principle implement a different convention using ScientificTypesBase.jl)

As you correctly explain, when you implement an interface for an MLJ model, the interface must make type adjustments to account for any mismatch between what a model expects and conceptualises as "multiclass vector", say, and what objects actually have AbstractVector{<:Multiclass} as scitype under the convention. If the core model expects Vector{<:String}, say (whose instances have scitype AbstractVector{Textual}) then the interface, having declared the expected scitype (eg target_scitype) to be Abstract{<:Multiclass}, will need to convert the data received by MLJBase.fit and MLJBase.predict (typically an unordered categorical vector) into a the Vector{String} type the internal model requires. (An exception occurs if an implementation overloads an additional "data front end".)

In this way scientific types is simply a way to: (i) enforce a uniformity in the data types that MLJ users present to their models, and (ii) allow the user to focus on the purpose (scientific type) of data rather on the specific machine representation.

roland-KA commented 2 years ago

Thank's for your clarifying feedback! Step by step I'm working towards a full understanding of the concepts 🤓.