Develop data_preprocessing_core() for predict_ml()

Title: Implement Comprehensive Data Preprocessing for predict_ml() ML Pipeline - first step in the back-end of predict_ml().

Description: The data_preprocessing_core() function aims to provide a comprehensive preprocessing pipeline for datasets containing mixed data types, preparing them for machine learning tasks. This enhancement proposal outlines the functionality and design of the preprocessing pipeline, including support for handling numerical, categorical, text, and datetime data.

Proposed Changes:

Comprehensive Preprocessing: Implement a preprocessing pipeline capable of handling various data types, including numerical, categorical, text, and datetime features. The pipeline will support flexible imputation, scaling, encoding, and vectorization methods tailored to the specific needs of each data type.
Automatic Data Splitting: Introduce automatic splitting of the dataset into training and test sets, with customizable proportions specified by the user. This feature will facilitate the evaluation of model performance on unseen data.
Task Inference: Infer the type of machine learning task (regression or classification) based on the data type of the target variable. This inference will guide subsequent modeling steps and ensure compatibility with appropriate algorithms.
Customizable Parameters: Allow users to specify parameters such as test size, random state for data splitting, and various preprocessing options (imputation strategies, scaling methods, encoding techniques, etc.) to customize the preprocessing pipeline according to their specific requirements.
Error Handling and Validation: Implement robust error handling and validation mechanisms to ensure the integrity of input data and parameter settings. Raise informative error messages for invalid inputs or unsupported operations to guide users in troubleshooting potential issues.

Expected Outcome: Upon implementation, the data_preprocessing_core function will provide users with a powerful and flexible tool for preparing datasets for machine learning tasks. By automating and standardizing the preprocessing steps, this enhancement will streamline the data preparation process, reduce manual effort, and improve the reproducibility and reliability of machine learning experiments.

Additional Context: The proposed preprocessing pipeline addresses the critical need for efficient data preparation in machine learning workflows. By encapsulating best practices and offering customizable options, it empowers users to focus on model development and analysis while ensuring the quality and consistency of input data. This enhancement aligns with our commitment to advancing machine learning research and fostering innovation in data science.

Implementation Summary

data_preprocessing_core() is an advanced preprocessing function designed for handling mixed data types in a machine learning pipeline. It performs critical preprocessing tasks such as imputation, scaling, encoding, and vectorization to ensure that datasets are optimally prepared for model training and prediction. This function plays a vital role within the predict_ml() pipeline.

Code Breakdown

Initial Setup and Error Handling
- Ensures all input data types and formats are correct and raises detailed errors if they do not meet expected standards.
- Includes checks for DataFrame integrity, column name types, and essential method availability in transformers.

if not isinstance(df, pd.DataFrame):
    raise TypeError("data_preprocessing_core(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(x_cols, list) or not all(isinstance(col, str) for col in x_cols):
    raise TypeError("data_preprocessing_core(): The 'x_cols' parameter must be a list of strings representing column names.")
if not isinstance(y_col, str):
    raise TypeError("data_preprocessing_core(): The 'y_col' parameter must be a string representing the target column name.")
if not isinstance(data_state, str):
    raise TypeError("data_preprocessing_core(): The 'data_state' parameter must be a string and one of ['unprocessed', 'preprocessed'].")
# Additional type and method checks...

Data Splitting
- Splits the dataset into training and testing sets using Scikit-learn's train_test_split to facilitate model training and evaluation.

x_train, x_test, y_train, y_test = train_test_split(df[x_cols], df[y_col], test_size=test_size, random_state=random_state)

Task Type Determination
- Determines whether the machine learning task is regression or classification based on the data type of the target variable.

y_dtype = evaluate_dtype(df, [y_col], output='dict')[y_col]
task_type = 'regression' if y_dtype == 'numerical' else 'classification'

Conditional Preprocessing Based on Data State
- If the data state indicates 'unprocessed', transformation pipelines are defined and applied for numeric, categorical, text, and datetime data.

if data_state.lower() == 'unprocessed':
    numeric_transformer = Pipeline(steps=[
        ('imputer', numeric_imputer),
        ('scaler', numeric_scaler)
    ])
    categorical_transformer = Pipeline(steps=[
        ('imputer', categorical_imputer),
        ('encoder', categorical_encoder)
    ])
    text_transformer = Pipeline(steps=[
        ('vectorizer', text_vectorizer)
    ])
    datetime_transformer = FunctionTransformer(datetime_feature_extractor, validate=False)
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'numerical']),
        ('cat', categorical_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'categorical']),
        ('text', text_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'text']),
        ('datetime', datetime_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'datetime'])
    ], remainder='passthrough')

Apply Preprocessing
- Preprocessing models are fit to the training data, and then both training and testing sets are transformed.

x_train_processed = preprocessor.fit_transform(x_train)
x_test_processed = preprocessor.transform(x_test)

Verbose Output
- Provides detailed feedback on the preprocessing operations, which can be toggled by the verbosity level, informing the user about processed features and transformers used.

if verbose > 0:
    print(f"Numerical features processed using {type(numeric_imputer).__name__} & {type(numeric_scaler).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'numerical'])}")
    print(f"Categorical features processed using {type(categorical_imputer).__name__} & {type(categorical_encoder).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'categorical'])}")
    print(f"Text features processed using {type(text_vectorizer).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output

='dict').items() if dtype == 'text'])}")
    print(f"Datetime features processed using Custom DateTime Processing: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'datetime'])}")

ETA444 / datasafari

Develop data_preprocessing_core() for predict_ml() #101

Implementation Summary

Code Breakdown

Link to Full Code