Develop front-end of predict_ml()

ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.

GNU General Public License v3.0

2 stars 0 forks source link

Title: Develop Automated Predictive Model Selection and Statistical Inference Tool

Description: This project aims to develop an automated tool, predict_ml(), for predictive model selection and statistical inference. The tool streamlines the process of data preprocessing, model selection, and tuning, providing recommendations for the best model based on user data and preferences.

Proposed Changes:

Unified Functionality: Create a single function, predict_ml(), capable of handling both machine learning (ML) and statistical inference tasks based on user inputs.
Machine Learning Pipeline:
- Data Preprocessing: Implement preprocessing steps for handling numerical, categorical, text, and datetime data. This includes imputation, scaling, encoding, and vectorization methods.
- Model Evaluation: Evaluate untuned models using a composite score that synthesizes multiple metrics, with options to prioritize specific metrics based on user preferences.
- Model Tuning: Utilize top untuned models for hyperparameter tuning via grid search, random search, or Bayesian optimization, providing flexibility in the tuning process.
Statistical Inference Pipeline:
- Model Recommendation: Recommend top statistical models for inference based on a specified formula, dynamically determining regression or classification tasks.
Input Handling: Ensure robust handling of input parameters, including validation and error handling, to guide users in providing appropriate inputs.
Output: Provide a dictionary containing either top ML models with evaluation metrics or statistical models along with fit statistics, depending on the operation mode selected by the user.

Expected Outcome: Upon completion, the predict_ml function will serve as a comprehensive tool for automating predictive modeling and statistical inference tasks. By integrating data preprocessing, model selection, and tuning into a unified framework, this tool will enhance user productivity, facilitate informed decision-making, and streamline the entire modeling process.

Additional Context: The proposed development addresses the growing need for automated tools that simplify and expedite the process of predictive modeling and statistical analysis. By offering a versatile and user-friendly solution, this project aims to empower users with the capabilities to efficiently analyze data and derive actionable insights, driving advancements in various domains reliant on data-driven decision-making.

Implementation Summary

predict_ml() is designed to handle both statistical inference and predictive model selection based on the provided data and user preferences. It automates the process from data preprocessing to model tuning, offering a streamlined approach to model evaluation and selection using machine learning or statistical methods.

Code Breakdown

Input Validation and Setup
- Validates the provided DataFrame, feature and target specifications, or a formula for statistical models. Ensures necessary inputs are present for the chosen analysis path.

if not df.empty and formula:
    # Proceed with statistical inference
elif x_cols and y_col and not df.empty:
    # Proceed with machine learning model selection
else:
    raise ValueError("Insufficient input data provided.")

Data Preprocessing
- Handles data preprocessing for machine learning tasks, setting up data based on user specifications for unprocessed or preprocessed states.

if data_state == 'unprocessed':
    processed_data = data_preprocessing_core(df, x_cols, y_col, test_size, random_state, numeric_imputer, numeric_scaler, categorical_imputer, categorical_encoder, text_vectorizer, datetime_transformer, verbose)
else:
    processed_data = df[x_cols + [y_col]]

Model Recommendation and Evaluation
- Recommends top models based on specified metrics, applying a composite score to rank models and select the best performers for further tuning.

recommended_models = model_recommendation_core(processed_data, task_type, priority_metrics, n_top_models, cv, verbose)

Model Tuning
- Applies hyperparameter tuning using specified methods (grid, random, Bayesian) to the recommended models, refining them based on performance metrics.

tuned_models = model_tuning_core(recommended_models, task_type, priority_tuners, custom_param_grids, n_jobs, cv, n_iter_random, n_iter_bayesian, refit_metric, verbose, random_state)

Statistical Inference
- For statistical analysis tasks, fits models based on a provided formula, evaluating them according to statistical metrics like AIC, BIC, and R-squared.

if formula:
    inference_results = model_recommendation_core_inference(df, formula, priority_models, n_top_models, model_kwargs, verbose)

Output Construction
- Constructs a dictionary output containing the best models from the machine learning or statistical inference pipelines, enriched with relevant metrics and model details.

return {
    'ML_Models': tuned_models if not formula else None,
    'Statistical_Models': inference_results if formula else None
}

Example Usage

Machine Learning Pipeline

ml_models = predict_ml(df, x_cols=['Age', 'Salary', 'Department'], y_col='Salary', verbose=2)

Inference Pipeline

inference_models = predict_ml(df, formula='Salary ~ Age + C(Department)', verbose=2)

ETA444 / datasafari