ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Develop front-end of predict_ml() #119

Closed ETA444 closed 6 months ago

ETA444 commented 6 months ago

Title: Develop Automated Predictive Model Selection and Statistical Inference Tool

Description: This project aims to develop an automated tool, predict_ml(), for predictive model selection and statistical inference. The tool streamlines the process of data preprocessing, model selection, and tuning, providing recommendations for the best model based on user data and preferences.

Proposed Changes:

Expected Outcome: Upon completion, the predict_ml function will serve as a comprehensive tool for automating predictive modeling and statistical inference tasks. By integrating data preprocessing, model selection, and tuning into a unified framework, this tool will enhance user productivity, facilitate informed decision-making, and streamline the entire modeling process.

Additional Context: The proposed development addresses the growing need for automated tools that simplify and expedite the process of predictive modeling and statistical analysis. By offering a versatile and user-friendly solution, this project aims to empower users with the capabilities to efficiently analyze data and derive actionable insights, driving advancements in various domains reliant on data-driven decision-making.

ETA444 commented 6 months ago

Implementation Summary

predict_ml() is designed to handle both statistical inference and predictive model selection based on the provided data and user preferences. It automates the process from data preprocessing to model tuning, offering a streamlined approach to model evaluation and selection using machine learning or statistical methods.

Code Breakdown

if not df.empty and formula:
    # Proceed with statistical inference
elif x_cols and y_col and not df.empty:
    # Proceed with machine learning model selection
else:
    raise ValueError("Insufficient input data provided.")
if data_state == 'unprocessed':
    processed_data = data_preprocessing_core(df, x_cols, y_col, test_size, random_state, numeric_imputer, numeric_scaler, categorical_imputer, categorical_encoder, text_vectorizer, datetime_transformer, verbose)
else:
    processed_data = df[x_cols + [y_col]]
recommended_models = model_recommendation_core(processed_data, task_type, priority_metrics, n_top_models, cv, verbose)
tuned_models = model_tuning_core(recommended_models, task_type, priority_tuners, custom_param_grids, n_jobs, cv, n_iter_random, n_iter_bayesian, refit_metric, verbose, random_state)
if formula:
    inference_results = model_recommendation_core_inference(df, formula, priority_models, n_top_models, model_kwargs, verbose)
return {
    'ML_Models': tuned_models if not formula else None,
    'Statistical_Models': inference_results if formula else None
}

Example Usage

Machine Learning Pipeline

ml_models = predict_ml(df, x_cols=['Age', 'Salary', 'Department'], y_col='Salary', verbose=2)

Inference Pipeline

inference_models = predict_ml(df, formula='Salary ~ Age + C(Department)', verbose=2)

Link to Full Code