ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Develop data_preprocessing_core() for predict_ml() #101

Closed ETA444 closed 4 months ago

ETA444 commented 5 months ago

Title: Implement Comprehensive Data Preprocessing for predict_ml() ML Pipeline - first step in the back-end of predict_ml().

Description: The data_preprocessing_core() function aims to provide a comprehensive preprocessing pipeline for datasets containing mixed data types, preparing them for machine learning tasks. This enhancement proposal outlines the functionality and design of the preprocessing pipeline, including support for handling numerical, categorical, text, and datetime data.

Proposed Changes:

Expected Outcome: Upon implementation, the data_preprocessing_core function will provide users with a powerful and flexible tool for preparing datasets for machine learning tasks. By automating and standardizing the preprocessing steps, this enhancement will streamline the data preparation process, reduce manual effort, and improve the reproducibility and reliability of machine learning experiments.

Additional Context: The proposed preprocessing pipeline addresses the critical need for efficient data preparation in machine learning workflows. By encapsulating best practices and offering customizable options, it empowers users to focus on model development and analysis while ensuring the quality and consistency of input data. This enhancement aligns with our commitment to advancing machine learning research and fostering innovation in data science.

ETA444 commented 4 months ago

Implementation Summary

data_preprocessing_core() is an advanced preprocessing function designed for handling mixed data types in a machine learning pipeline. It performs critical preprocessing tasks such as imputation, scaling, encoding, and vectorization to ensure that datasets are optimally prepared for model training and prediction. This function plays a vital role within the predict_ml() pipeline.

Code Breakdown

if not isinstance(df, pd.DataFrame):
    raise TypeError("data_preprocessing_core(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(x_cols, list) or not all(isinstance(col, str) for col in x_cols):
    raise TypeError("data_preprocessing_core(): The 'x_cols' parameter must be a list of strings representing column names.")
if not isinstance(y_col, str):
    raise TypeError("data_preprocessing_core(): The 'y_col' parameter must be a string representing the target column name.")
if not isinstance(data_state, str):
    raise TypeError("data_preprocessing_core(): The 'data_state' parameter must be a string and one of ['unprocessed', 'preprocessed'].")
# Additional type and method checks...
x_train, x_test, y_train, y_test = train_test_split(df[x_cols], df[y_col], test_size=test_size, random_state=random_state)
y_dtype = evaluate_dtype(df, [y_col], output='dict')[y_col]
task_type = 'regression' if y_dtype == 'numerical' else 'classification'
if data_state.lower() == 'unprocessed':
    numeric_transformer = Pipeline(steps=[
        ('imputer', numeric_imputer),
        ('scaler', numeric_scaler)
    ])
    categorical_transformer = Pipeline(steps=[
        ('imputer', categorical_imputer),
        ('encoder', categorical_encoder)
    ])
    text_transformer = Pipeline(steps=[
        ('vectorizer', text_vectorizer)
    ])
    datetime_transformer = FunctionTransformer(datetime_feature_extractor, validate=False)
    preprocessor = ColumnTransformer(transformers=[
        ('num', numeric_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'numerical']),
        ('cat', categorical_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'categorical']),
        ('text', text_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'text']),
        ('datetime', datetime_transformer, [col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'datetime'])
    ], remainder='passthrough')
x_train_processed = preprocessor.fit_transform(x_train)
x_test_processed = preprocessor.transform(x_test)
if verbose > 0:
    print(f"Numerical features processed using {type(numeric_imputer).__name__} & {type(numeric_scaler).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'numerical'])}")
    print(f"Categorical features processed using {type(categorical_imputer).__name__} & {type(categorical_encoder).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'categorical'])}")
    print(f"Text features processed using {type(text_vectorizer).__name__}: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output

='dict').items() if dtype == 'text'])}")
    print(f"Datetime features processed using Custom DateTime Processing: {', '.join([col for col, dtype in evaluate_dtype(df, x_cols, output='dict').items() if dtype == 'datetime'])}")

Link to Full Code