JuliaAI / MLJModels.jl

Home of the MLJ model registry and tools for model queries and mode code loading
MIT License
81 stars 27 forks source link

Syntax for feature engineering #314

Open tlienart opened 4 years ago

tlienart commented 4 years ago

I stumbled upon https://github.com/matthieugomez/PairsMacros.jl today and it seems to be close to what we discussed with @vollmersj with respect to defining new columns with a formula-like syntax.

@matthieugomez sorry to ping you here but would you be interested in something like PairsMacros for general-purpose feature engineering to work with MLJ?

AriMKatz commented 3 years ago

There's also this: https://github.com/joshday/Telperion.jl

ablaom commented 2 years ago

Continuing the discussion started by @indymnv at https://github.com/alan-turing-institute/MLJ.jl/issues/970:

Existing MLJ transformers are documented here with the exception of InteractionTransformer, which was recently added to MLJModels, but is not documented or re-exported yet by MLJ.jl. Here's the list:

julia> using MLJModels

julia> models() do m
       m.package_name == "MLJModels" &&
       !m.is_supervised
       end
11-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = InteractionTransformer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

A "fancier" version of InteractionTransformer, based on R type "formulas", has been planned, but no-one has really found the time to work on it.

There is a project in progress to roll out a feature_importance method for models that support that, with the idea of adding feature selection tools, such as recursive feature elimination.

TableTransforms.jl referenced by @juliohm is very active but not yet integrated with MLJ, although we are working towards doing so in the future (at least several months off). I think that is good place to contribute generic table transformers, such as encoders. Some feature engineering tools, such as RFE, will probably not make sense there, as they require supervised learners, for example.

@indymnv It would be helpful if you can identify specific encoders or other tools you use frequently that are missing from MLJ (or TableTransforms.jl) so they can be prioritised.

indymnv commented 2 years ago

@ablaom Thanks for all the information, in general in my work with ML I use the following encoders a lot.

  1. For categorical variables

    • Ordinal Encoding: replaces categories by numbers arbitrarily or ordered by target @ablaom says: done - use coerce from ScientificTypes.jl
    • Frequency Encoder: replaces categories by the observation count or percentage
    • One-Hote Encoder: done.
    • grouped tail encoder: groups infrequent categories
  2. For dates and other cyclic variables:

    • Cyclical encoder: creates variables using sine and cosine
  3. For some numerical variables:

    • Equal Frequency Discretiser: sorts variables into equal frequency intervals @ablaom says: done - UnivariateDiscretizer
    • Equal Width Discretiser: sorts variables into equal-width intervals.
  4. Transformations:

    • Logarithm @ablaom says done - any kind of ordinary function can be inserted in pipeline or used in TransformedTargetModel wrapper
    • Box-Cox @ablaom says: done (with learned exponent) -UnivariateBoxCoxTransformer
    • Yeo-Johnson
  5. Standardization and Normalization @ablaom says done - Standardizer

  6. Feature Selection:

    • I use the feature selection, built-in ml models from scikit-learn, or Boruta.

For now, in Julia I have only used One-Hot-encoder, I have not checked the transformations.

[Edit]: As a context, I frequently work with linear/logistic regression models, Decision-Tree, Random Forest and GBM.

ablaom commented 2 years ago

Thanks @indymnv . That's most helpful. PR's for missing items welcome 😉