ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for predict_ml() #51

Closed ETA444 closed 2 months ago

ETA444 commented 2 months ago

Written and accessible:

help(predict_ml)

Summary:

The predict_ml() function automates and simplifies data preprocessing, model selection, and model tuning, resulting in a recommendation of the best model given the user's data. It can operate in two modes: machine learning pipeline mode for predictive model selection and hyperparameter tuning using scikit-learn, and inference pipeline mode for detailed statistical analysis and model fitting using statsmodels.

Docstring Sections Preview:

Description

"""
Automates and simplifies data preprocessing, model selection and model tuning, culminating in a recommendation of the best model given the user's data.

Depending on the inputs, this function can either perform statistical inference or predictive model selection using machine learning.
    - **Machine Learning Pipeline**: Focuses on predictive model selection and hyperparameter tuning using scikit-learn. It includes preprocessing, model recommendation based on specified metrics, and tuning using grid search, random search, or Bayesian optimization.
    - **Inference Pipeline**: Utilizes statsmodels for detailed statistical analysis and model fitting based on a specified formula. This pipeline is tailored for users seeking statistical inference, providing metrics such as AIC, BIC, and R-squared.
"""

Parameters

"""
Parameters
----------
df : pd.DataFrame
    The DataFrame containing the dataset to be analyzed.
x_cols : List[str], optional
    List of column names to be used as features for machine learning model recommendation.
y_col : str, optional
    Column name to be used as the target for machine learning model recommendation.
formula : str, optional
    A Patsy formula for specifying the model in the case of statistical inference.
data_state : str, optional
    Specifies the initial state of the data ('unprocessed' or 'preprocessed'). Default is 'unprocessed'.
test_size : float, optional
    Proportion of the dataset to be used as the test set. Default is 0.2.
cv : int, optional
    Number of cross-validation folds. Default is 5.
random_state : int, optional
    Controls the shuffling applied to the data before applying the split.
priority_metrics : List[str], optional
    Metrics to prioritize in model evaluation in the machine learning pipeline.
refit_metric : Optional[Union[str, Callable]], optional
    Metric to use for refitting the models in the machine learning pipeline.
priority_tuners : List[str], optional
    Tuners to use for hyperparameter tuning in the machine learning pipeline.
custom_param_grids : dict, optional
    Custom parameter grids for tuning in the machine learning pipeline.
n_jobs : int, optional
    Number of jobs to run in parallel. -1 means using all processors. Default is -1.
n_iter_random : int, optional
    Number of iterations for random search tuning in the machine learning pipeline.
n_iter_bayesian : int, optional
    Number of iterations for Bayesian optimization in the machine learning pipeline.
n_top_models : int, optional
    Number of top models to recommend from the evaluation.
priority_models : List[str], optional
    Specific models to evaluate in the inference pipeline.
model_kwargs : dict, optional
    Keyword arguments to pass to model constructors in the inference pipeline.
verbose : int, optional
    Level of verbosity in output.
numeric_imputer : TransformerMixin, optional
    Imputer for handling missing values in numerical data.
numeric_scaler : TransformerMixin, optional
    Scaler for numerical data.
categorical_imputer : TransformerMixin, optional
    Imputer for handling missing values in categorical data.
categorical_encoder : TransformerMixin, optional
    Encoder for categorical data.
text_vectorizer : TransformerMixin, optional
    Vectorizer for text data.
datetime_transformer : callable, optional
    Transformer for datetime data.
"""

Raises

"""
Raises
------
TypeError
    - If 'df' is not a pandas DataFrame, ensuring that the input data structure is correct for model fitting.
    - If 'x_cols' is provided and is not a list of strings.
    - If 'y_col' is provided and is not a string.
    - If 'formula' is provided and is not a string.
    - If 'data_state' is provided and is not a string.
    - If 'priority_metrics' is provided and is not a list of strings.
    - If 'priority_tuners' is provided and is not a list of strings.
    - If 'custom_param_grids' is provided and is not a dictionary.
    - If 'priority_models' is provided and is not a list of strings.
    - If 'model_kwargs' is provided and is not a dictionary.
    - If 'verbose' is provided and is not an integer.
    - If any of the transformer parameters is provided and is not an instance of TransformerMixin or callable.
...
"""

Returns

"""
Returns
-------
Dict[str, Any]
    Depending on the operation mode, returns either:
    - a dictionary of top machine learning models and their evaluation metrics,
    - a dictionary of statistical models along with their fit statistics.
"""

Examples

"""
Examples
--------
>>> df = pd.DataFrame({
...     'Age': np.random.randint(18, 35, size=100),
...     'Salary': np.random.normal(50000, 12000, size=100),
...     'Department': np.random.choice(['HR', 'Tech', 'Marketing'], size=100),
...     'Review': ['Good review']*50 + ['Bad review']*50,
...     'Employment Date': pd.date_range(start='2010-01-01', periods=100, freq='M')
... })
>>> # Machine Learning Pipeline
>>> x_cols = ['Age', 'Salary', 'Department', 'Review', 'Employment Date']
>>> y_col = 'Salary'
>>> ml_models = predict_ml(df, x_cols=x_cols, y_col=y_col, verbose=2)
>>> # Inference Pipeline
>>> formula = 'Salary ~ Age + C(Department)'
>>> inference_models = predict_ml(df, formula=formula, verbose=2)
"""

Notes

"""
Notes
-----
    1. Machine Learning Pipeline
        1.1. Data Preprocessing (optional): prepares a dataset for machine learning by handling numerical, categorical, text, and datetime data. It supports flexible imputation, scaling, encoding, and vectorization methods to cater to a wide range of preprocessing needs. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.
        1.2. Evaluation of Untuned models: leverages a composite score for model evaluation, which synthesizes scores across multiple metrics, weighted by the specified priorities. This method enables a holistic and nuanced

 model comparison, taking into account the multidimensional aspects of model performance.
            - Priority Metrics: Assigning weights (default: 5 for prioritized metrics, 1 for others) allows users to emphasize metrics they find most relevant, affecting the composite score calculation.
            - Composite Score: Calculated as a weighted average of metric scores, normalized by the total weight. This score serves as a basis for ranking models.
        1.3. Model Tuning: Uses top N untuned models to tune. Systematically applies grid search, random search, or Bayesian optimization to explore the hyperparameter space of given models. It supports customization of the tuning process through various parameters and outputs the best found configurations.
    2. Statistical Inference Pipeline
        - Recommends top statistical models for inference based on user-specified preferences and formula.
        - This function evaluates various statistical models from statsmodels, each suitable for either regression or classification tasks determined dynamically by the nature of the target variable.
"""