ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for model_recommendation_core_inference() #116

Closed ETA444 closed 5 months ago

ETA444 commented 5 months ago

Written and accessible:


This function generates a NumPy docstring for model_recommendation_core_inference(), providing a comprehensive description of its purpose, parameters, return values, exceptions, examples, and additional notes.


The function model_recommendation_core_inference() recommends top statistical models for inference based on user-specified preferences and a formula. It evaluates various statistical models from statsmodels, suitable for either regression or classification tasks determined dynamically by the nature of the target variable.

Docstring Sections Preview:


Recommends top statistical models for inference based on user-specified preferences and formula.
This function evaluates various statistical models from statsmodels, each suitable for either
regression or classification tasks determined dynamically by the nature of the target variable.


df : pd.DataFrame
    DataFrame containing the data to fit the models.
formula : str
    A patsy formula specifying the model. The target variable is on the left of '~'.
priority_models : List[str], optional
    A list of model names to restrict the evaluation to specific models, otherwise all applicable models are evaluated.
n_top_models : int, optional
    Number of top-performing models to return based on sorted metrics. Defaults to 3.
model_kwargs : dict, optional
    Dictionary mapping model names to dictionaries of additional keyword arguments to pass to the model constructors.
    This can be used to pass additional parameters required by specific models.
verbose : int, optional
    The verbosity level: 0 means silent, 1 outputs summary results, 2 includes detailed model summaries.


    - If 'df' is not a pandas DataFrame, ensuring that the input data structure is correct for model fitting.
    - If 'formula' is not a string, verifying that the model formula is correctly specified as a string.
    - If 'priority_models' is provided and is not a list of strings, ensuring the user specifies a proper list of model names.
    - If 'model_kwargs' is provided and is not a dictionary, ensuring the correct format for passing additional keyword arguments to model constructors.
    - If 'verbose' is not an integer, verifying that the verbosity level is specified as an integer.

    - If the input DataFrame is empty, ensuring that there is data available for model fitting.
    - If 'formula' does not contain exactly one '~', which is necessary to separate the dependent and independent variables in the model specification.
    - If the specified target variable from 'formula' is not found in the DataFrame, ensuring the formula correctly references a column in the DataFrame.
    - If any variables specified in the 'formula' for independent variables are not found in the DataFrame, checking for the presence of all required variables in the DataFrame.
    - If 'n_top_models' is not a positive integer, ensuring that the number of models to return is specified correctly.


Dict[str, Any]
    A dictionary with model names as keys and dictionaries as values. Each dictionary contains the 'model' object,
    'metrics' dictionary with performance metrics, and potentially 'summary' if verbose > 1.


>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({
...     'Age': np.random.randint(18, 70, size=100),
...     'Salary': np.random.normal(50000, 15000, size=100),
...     'Experience': np.random.randint(1, 30, size=100)
... })
>>> formula = 'Salary ~ Age + Experience'
>>> best_inference_models = model_recommendation_core_inference(
...     df,
...     formula,
...     verbose=2
... )
>>> # Accessing the best model's object
>>> best_model_name = list(best_inference_models.keys())[0]
>>> best_model = best_inference_models[best_model_name]['model']
>>> # Viewing the summary of the best model
>>> print(best_model.summary())
>>> # Extracting AIC of the best model
>>> best_model_aic = best_inference_models[best_model_name]['metrics']['AIC']
>>> print(f"The best model according to AIC is {best_model_name} with an AIC of {best_model_aic:.2f}")


- **Dynamic Model Evaluation**: Depending on the datatype of the target variable specified in the formula,
  the function dynamically decides whether to treat the problem as a regression or classification task,
  using appropriate metrics and models for each.

- **Handling Model Specific Requirements**: This function allows passing custom arguments to model constructors
  to handle models that require specific parameters via `model_kwargs`.

- **Metric Adjustments**: For metrics where a lower value is better (e.g., AIC, BIC), these are adjusted
  to be compared directly alongside higher-is-better metrics like R-squared

, by negating their values during sorting.

- **Verbose Output**: The function provides different levels of output detail which can help in diagnosing model fit
  or understanding model performance.

- **Error Handling**: The function will report and skip models that encounter errors during fitting, allowing for
  robust execution even if some models are not applicable to the provided data or formula.