The predict_ml() function automates and simplifies data preprocessing, model selection, and model tuning, resulting in a recommendation of the best model given the user's data. It can operate in two modes: machine learning pipeline mode for predictive model selection and hyperparameter tuning using scikit-learn, and inference pipeline mode for detailed statistical analysis and model fitting using statsmodels.
Docstring Sections Preview:
Description
"""
Automates and simplifies data preprocessing, model selection and model tuning, culminating in a recommendation of the best model given the user's data.
Depending on the inputs, this function can either perform statistical inference or predictive model selection using machine learning.
- **Machine Learning Pipeline**: Focuses on predictive model selection and hyperparameter tuning using scikit-learn. It includes preprocessing, model recommendation based on specified metrics, and tuning using grid search, random search, or Bayesian optimization.
- **Inference Pipeline**: Utilizes statsmodels for detailed statistical analysis and model fitting based on a specified formula. This pipeline is tailored for users seeking statistical inference, providing metrics such as AIC, BIC, and R-squared.
"""
Parameters
"""
Parameters
----------
df : pd.DataFrame
The DataFrame containing the dataset to be analyzed.
x_cols : List[str], optional
List of column names to be used as features for machine learning model recommendation.
y_col : str, optional
Column name to be used as the target for machine learning model recommendation.
formula : str, optional
A Patsy formula for specifying the model in the case of statistical inference.
data_state : str, optional
Specifies the initial state of the data ('unprocessed' or 'preprocessed'). Default is 'unprocessed'.
test_size : float, optional
Proportion of the dataset to be used as the test set. Default is 0.2.
cv : int, optional
Number of cross-validation folds. Default is 5.
random_state : int, optional
Controls the shuffling applied to the data before applying the split.
priority_metrics : List[str], optional
Metrics to prioritize in model evaluation in the machine learning pipeline.
refit_metric : Optional[Union[str, Callable]], optional
Metric to use for refitting the models in the machine learning pipeline.
priority_tuners : List[str], optional
Tuners to use for hyperparameter tuning in the machine learning pipeline.
custom_param_grids : dict, optional
Custom parameter grids for tuning in the machine learning pipeline.
n_jobs : int, optional
Number of jobs to run in parallel. -1 means using all processors. Default is -1.
n_iter_random : int, optional
Number of iterations for random search tuning in the machine learning pipeline.
n_iter_bayesian : int, optional
Number of iterations for Bayesian optimization in the machine learning pipeline.
n_top_models : int, optional
Number of top models to recommend from the evaluation.
priority_models : List[str], optional
Specific models to evaluate in the inference pipeline.
model_kwargs : dict, optional
Keyword arguments to pass to model constructors in the inference pipeline.
verbose : int, optional
Level of verbosity in output.
numeric_imputer : TransformerMixin, optional
Imputer for handling missing values in numerical data.
numeric_scaler : TransformerMixin, optional
Scaler for numerical data.
categorical_imputer : TransformerMixin, optional
Imputer for handling missing values in categorical data.
categorical_encoder : TransformerMixin, optional
Encoder for categorical data.
text_vectorizer : TransformerMixin, optional
Vectorizer for text data.
datetime_transformer : callable, optional
Transformer for datetime data.
"""
Raises
"""
Raises
------
TypeError
- If 'df' is not a pandas DataFrame, ensuring that the input data structure is correct for model fitting.
- If 'x_cols' is provided and is not a list of strings.
- If 'y_col' is provided and is not a string.
- If 'formula' is provided and is not a string.
- If 'data_state' is provided and is not a string.
- If 'priority_metrics' is provided and is not a list of strings.
- If 'priority_tuners' is provided and is not a list of strings.
- If 'custom_param_grids' is provided and is not a dictionary.
- If 'priority_models' is provided and is not a list of strings.
- If 'model_kwargs' is provided and is not a dictionary.
- If 'verbose' is provided and is not an integer.
- If any of the transformer parameters is provided and is not an instance of TransformerMixin or callable.
...
"""
Returns
"""
Returns
-------
Dict[str, Any]
Depending on the operation mode, returns either:
- a dictionary of top machine learning models and their evaluation metrics,
- a dictionary of statistical models along with their fit statistics.
"""
"""
Notes
-----
1. Machine Learning Pipeline
1.1. Data Preprocessing (optional): prepares a dataset for machine learning by handling numerical, categorical, text, and datetime data. It supports flexible imputation, scaling, encoding, and vectorization methods to cater to a wide range of preprocessing needs. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.
1.2. Evaluation of Untuned models: leverages a composite score for model evaluation, which synthesizes scores across multiple metrics, weighted by the specified priorities. This method enables a holistic and nuanced
model comparison, taking into account the multidimensional aspects of model performance.
- Priority Metrics: Assigning weights (default: 5 for prioritized metrics, 1 for others) allows users to emphasize metrics they find most relevant, affecting the composite score calculation.
- Composite Score: Calculated as a weighted average of metric scores, normalized by the total weight. This score serves as a basis for ranking models.
1.3. Model Tuning: Uses top N untuned models to tune. Systematically applies grid search, random search, or Bayesian optimization to explore the hyperparameter space of given models. It supports customization of the tuning process through various parameters and outputs the best found configurations.
2. Statistical Inference Pipeline
- Recommends top statistical models for inference based on user-specified preferences and formula.
- This function evaluates various statistical models from statsmodels, each suitable for either regression or classification tasks determined dynamically by the nature of the target variable.
"""
Written and accessible:
Summary:
The
predict_ml()
function automates and simplifies data preprocessing, model selection, and model tuning, resulting in a recommendation of the best model given the user's data. It can operate in two modes: machine learning pipeline mode for predictive model selection and hyperparameter tuning using scikit-learn, and inference pipeline mode for detailed statistical analysis and model fitting using statsmodels.Docstring Sections Preview:
Description
Parameters
Raises
Returns
Examples
Notes