Develop model_recommendation_core_inference() for predict_ml()

ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.

GNU General Public License v3.0

2 stars 0 forks source link

Title: Develop Model Recommendation Core for Statistical Inference (second back-end pipeline in predict_ml()

Description: The proposed development involves creating an enhanced version of the model_recommendation_core_inference() function. This function aims to recommend top statistical models for inference based on user-specified preferences and formula. It evaluates various statistical models from statsmodels, dynamically determining their suitability for either regression or classification tasks based on the nature of the target variable.

Proposed Changes:

Dynamic Model Evaluation: Develop functionality to dynamically decide whether to treat the problem as a regression or classification task based on the datatype of the target variable specified in the formula. This ensures the selection of appropriate metrics and models for each task type.
Model Specific Requirements: Enable passing custom arguments to model constructors via the model_kwargs parameter. This allows handling models that require specific parameters, enhancing the flexibility and applicability of the function.
Metric Adjustments: Adjust metrics where a lower value is better (e.g., AIC, BIC) to be compared directly alongside higher-is-better metrics like R-squared. This adjustment is achieved by negating their values during sorting, ensuring a consistent comparison across different metrics.
Verbose Output: Provide different levels of output detail to assist in diagnosing model fit or understanding model performance. Users can specify the verbosity level, with options to output summary results or include detailed model summaries.
Error Handling: Implement robust error handling mechanisms to report and skip models that encounter errors during fitting. This ensures robust execution even if some models are not applicable to the provided data or formula, enhancing the reliability of the function.

Expected Outcome: Upon completion of development, the enhanced model_recommendation_core_inference() function will serve as a powerful tool for recommending top statistical models for inference. By dynamically evaluating models and providing detailed insights into model performance, this enhancement will streamline the model selection process, improve user understanding, and facilitate informed decision-making in statistical inference tasks.

Additional Context: The proposed development aligns with the increasing demand for automated and efficient model selection in statistical inference tasks. By incorporating dynamic evaluation, customizable model parameters, and robust error handling, the enhanced function aims to address the diverse needs of users in statistical modeling and analysis. This development initiative underscores our commitment to advancing automation and productivity in statistical inference workflows.

Implementation Summary

model_recommendation_core_inference() is designed to recommend the best statistical models for inference tasks based on specific criteria provided by the user. It evaluates models from the statsmodels library, applicable for regression or classification based on the target variable's nature. The function leverages a patsy formula for model specification, allowing for dynamic selection of models suited to the data's characteristics.

Code Breakdown

Formula and Data Validation
- Validates the input DataFrame and formula to ensure proper specification and availability of variables. It checks that the formula includes a target and predictors separated by '~', and that these variables exist in the DataFrame.

if formula.count('~') != 1:
    raise ValueError("The formula must contain exactly one '~'.")
if y_col not in df.columns:
    raise ValueError(f"The target variable '{y_col}' specified in the formula is not found in the DataFrame.")

Model Fitting
- Fits statistical models specified in the priority_models list or defaults to applicable models based on the data type of the target variable. It handles custom arguments for models through model_kwargs.

models_to_evaluate = {name: model for name, model in all_models.items() if name in priority_models} if priority_models else all_models
for model_name, model_func in models_to_evaluate.items():
    model = model_func(formula, df, **model_kwargs.get(model_name, {})).fit()
    model_results[model_name] = {'model': model, 'metrics': evaluate_model(model)}

Model Evaluation
- Evaluates models based on statistical metrics (like AIC, BIC, R-squared). Adjusts metrics for direct comparability, prioritizing lower values for AIC/BIC by negating them.

for name, details in model_results.items():
    adjusted_metrics = {metric: (-value if metric in ['AIC', 'BIC'] else value) for metric, value in details['metrics'].items()}
    sorted_models.append((name, adjusted_metrics))

Top Model Selection
- Selects the top N models based on the sorted performance metrics. Provides comprehensive output depending on the verbosity level, including model summaries for detailed analysis.

top_models = sorted(sorted_models, key=lambda x: list(x[1].values()), reverse=True)[:n_top_models]
for model_name, metrics in top_models:
    print(f"Model: {model_name}, Metrics: {metrics}")
    if verbose > 1:
        print(model_results[model_name]['model'].summary())

Output and Verbose Logging
- Outputs the top models and optionally prints detailed model information based on the verbosity setting. This includes model summaries and performance metrics.

if verbose > 0:
    for name, details in top_models.items():
        print(f"Top Model: {name}")
        for metric, value in details['metrics'].items():
            print(f"{metric}: {value:.2f}")
        if verbose > 1:
            print(details['model'].summary())

ETA444 / datasafari

Develop model_recommendation_core_inference() for predict_ml() #115

Implementation Summary

Code Breakdown

Link to Full Code