ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for model_tuning_core() #113

Closed ETA444 closed 4 months ago

ETA444 commented 4 months ago

Written and accessible:

help(model_tuning_core)

This solution addresses the issue "Write NumPy docstring for model_tuning_core()" by providing a detailed NumPy-style docstring for the model_tuning_core() function.

Summary:

The function model_tuning_core() conducts hyperparameter tuning on a set of models using specified tuning methods and parameter grids, returning the best-tuned models along with their scores. It systematically applies grid search, random search, or Bayesian optimization to explore the hyperparameter space of given models. The docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.

Docstring Sections Preview:

Description

"""
Conducts hyperparameter tuning on a set of models using specified tuning methods and parameter grids,
and returns the best tuned models along with their scores.

This function systematically applies grid search, random search, or Bayesian optimization to explore the
hyperparameter space of given models. It supports customization of the tuning process through various parameters
and outputs the best found configurations.
"""

Parameters

"""
Parameters
----------
x_train : Union[pd.DataFrame, np.ndarray]
    Training feature dataset.
y_train : Union[pd.Series, np.ndarray]
    Training target variable.
task_type : str
    Specifies the type of machine learning task: 'classification' or 'regression'.
models : dict
    Dictionary with model names as keys and model instances as values.
priority_metrics : List[str], optional
    List of metric names given priority in model scoring. Default is None, which uses default metrics.
refit_metric : Optional[Union[str, Callable]], optional
    Metric to use for refitting the models. A string (name of the metric) or a scorer callable object/function
    with signature scorer(estimator, X, y). If None, the first metric listed in priority_metrics is used.
priority_tuners : List[str], optional
    List of tuner names to use for hyperparameter tuning. Valid tuners are 'grid', 'random', 'bayesian'.
custom_param_grids : dict, optional
    Custom parameter grids to use, overriding the default grids if provided. Each entry should be a model name
    mapped to its corresponding parameter grid.
n_jobs : int, optional
    Number of jobs to run in parallel. -1 means using all processors. Default is -1.
cv : int, optional
    Number of cross-validation folds. Default is 5.
n_iter_random : int, optional
    Number of iterations for random search. If None, default is set to 10.
n_iter_bayesian : int, optional
    Number of iterations for Bayesian optimization. If None, default is set to 50.
verbose : int, optional
    Level of verbosity. The higher the number, the more detailed the logging. Default is 1.
random_state : int, optional
    Seed used by the random number generator. Default is 42.
"""

Returns

"""
Returns
-------
Dict[str, Any]
    A dictionary containing the best models under each provided model name as keys. Values are dictionaries
    with keys: 'best_model' storing the model object of the best estimator and 'best_score' storing the corresponding score.
"""

Raises

"""
Raises
------
TypeError
    - If 'x_train' is not a pandas DataFrame or NumPy ndarray.
    - If 'y_train' is not a pandas Series or NumPy ndarray.
    - If 'task_type' is not a string.
    - If 'models' is not a dictionary with model names as keys and model instances as values.
    - If 'priority_metrics' is not None and is not a list of strings.
    - If 'priority_tuners' is not None and is not a list of strings.
    - If 'custom_param_grids' is not None and is not a dictionary.
    - If 'n_jobs', 'cv', 'n_iter_random', 'n_iter_bayesian', 'verbose', or 'random_state' are not integers.
    - If 'cv' is less than 1.
    - If 'n_iter_random' or 'n_iter_bayesian' is less than 1 when not None.
    - If 'refit_metric' is provided as a string but is not a callable or recognized metric name.
ValueError
    - If 'task_type' is not 'classification' or 'regression'.
    - If 'x_train' and 'y_train' do not have the same number of rows.
    - If 'x_train' or 'y_train' is empty (has zero elements).
    - If any element in 'priority_metrics' or 'priority_tuners' is not a string.
    - If 'priority_metrics' contains duplicate values.
    - If 'priority_tuners' contains unrecognized tuner names, not part of the expected tuners ('grid', 'random', 'bayesian').
    - If the specified 'refit_metric' is not applicable to the provided 'task_type' (e.g., using a regression metric for classification).
    - If 'n_iter_random_adjusted' or 'n_iter_bayesian_adjusted' becomes zero due to all combinations being previously tested, implying there are no new combinations to explore.
    - If 'n_iter_random' or 'n_iter_bayesian' is set to zero or a negative number.
"""

Examples

"""
Examples
--------
>>> from sklearn.datasets import load_iris
>>> from sklearn.model_selection import train_test_split
>>> X, y = load_iris(return_X_y=True)
>>> x_train, x_test,

 y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
>>> models = {'logistic_regression': LogisticRegression(), 'random_forest': RandomForestClassifier()}
>>> tuned_models = model_tuning_core(x_train, y_train, 'classification', models, priority_metrics=['accuracy', 'f1'], priority_tuners=['bayesian'], n_iter_random=20, verbose=2)
"""

Notes

"""
Notes
-----
- **Integration with Tuning Methods**: This function utilizes scikit-learn's `GridSearchCV` and `RandomizedSearchCV`, along with scikit-optimize's `BayesSearchCV` for hyperparameter tuning. The choice of tuning method (`grid`, `random`, or `bayesian`) depends on the entries provided in the `priority_tuners` list.

- **Skipping Repeated Combinations**: For Bayesian optimization (`BayesSearchCV`), the function is designed to skip evaluations of previously tested parameter combinations. This approach aims to enhance the efficiency and performance of the tuning process by reducing redundant computations.

- **Parameter Grids Importance**: The quality and range of the parameter grids significantly influence the effectiveness of the tuning process. While default parameter grids are provided for convenience, it is recommended to supply customized parameter grids via `custom_param_grids` to ensure a thorough exploration of meaningful parameter combinations.

- **Handling of User Warnings**: The `BayesSearchCV` from scikit-optimize occasionally emits warnings about the evaluation of repeated parameter points. This is a known issue within the library, unresolved for several years, which pertains to its internal handling of random state and the stochastic nature of the search algorithm. Although the function attempts to mitigate this by filtering out previously tested combinations, some warnings might still appear, especially under constraints of limited parameter grids or high `n_iter_bayesian` values.

- **Refit Consideration**: The refit process, which re-trains the best estimator on the full dataset using the best-found parameters, is governed by the `refit_metric`. This metric should be carefully chosen to align with the overall objective and the specifics of the task at hand (classification or regression).

- **Random State Usage**: The `random_state` parameter ensures reproducibility in the results of randomized search methods and Bayesian optimization, making the tuning outputs deterministic and easier to debug or review.

- **Parallel Processing Capability**: Setting `n_jobs=-1` enables the function to use all available CPU cores for parallel processing, speeding up the computation especially beneficial when dealing with large datasets or complex models.
"""