Improvement in the Preparing Data part

Hello guys. First I want to thank you for the work you are doing here. I'm here to suggest an improvement in the Preparing Data part of the MLJ. I missed some functionality of scikit-learn.

Topics covered in the MLJ documentation are:

Common data preprocessing workflows
- Scientific type coercion
- Data transformations
Scientific type coercion
Data transformation

In scikit-learn they are:

Standardization, or mean removal and variance scaling
- Scaling features to a range
- Scaling sparse data
- Scaling data with outliers
- Centering kernel matrices
Non-linear transformation
- Mapping to a Uniform distribution
- Mapping to a Gaussian distribution
Normalization
Encoding categorical features
- Infrequent categories
Discretization
K-bin discretization
- Feature binarization
Imputation of missing values
Generating polynomial features
- Polynomial features
- Spline transformer
Custom transformers

Thanks for positive feedback.

Most of these are actually implemented and documented here:

https://github.com/alan-turing-institute/MLJ.jl/blob/master/docs/src/transformers.md

There is an active PR to generate polynomial features (https://github.com/JuliaAI/MLJModels.jl/pull/478).

For an up-to-date list of built-in preprocessing transformers, follow this workflow:

using MLJModels
julia> models() do m
       !m.is_supervised && m.package_name=="MLJModels"
       end
10-element Vector{NamedTuple{(:name, :package_name, :is_supervised, :abstract_type, :deep_properties, :docstring, :fit_data_scitype, :human_name, :hyperparameter_ranges, :hyperparameter_types, :hyperparameters, :implemented_methods, :inverse_transform_scitype, :is_pure_julia, :is_wrapper, :iteration_parameter, :load_path, :package_license, :package_url, :package_uuid, :predict_scitype, :prediction_type, :reporting_operations, :reports_feature_importances, :supports_class_weights, :supports_online, :supports_training_losses, :supports_weights, :transform_scitype, :input_scitype, :target_scitype, :output_scitype)}}:
 (name = ContinuousEncoder, package_name = MLJModels, ... )
 (name = FeatureSelector, package_name = MLJModels, ... )
 (name = FillImputer, package_name = MLJModels, ... )
 (name = OneHotEncoder, package_name = MLJModels, ... )
 (name = Standardizer, package_name = MLJModels, ... )
 (name = UnivariateBoxCoxTransformer, package_name = MLJModels, ... )
 (name = UnivariateDiscretizer, package_name = MLJModels, ... )
 (name = UnivariateFillImputer, package_name = MLJModels, ... )
 (name = UnivariateStandardizer, package_name = MLJModels, ... )
 (name = UnivariateTimeTypeToContinuous, package_name = MLJModels, ... )

julia> doc("OneHotEncoder") # to get a detailed document string

Feel free to open separate request issues for missing items.

JuliaAI / MLJ.jl

Improvement in the Preparing Data part #964