Write NumPy docstring for data_preprocessing_core()

Written and accessible:

help(data_preprocessing_core)

This solution addresses the issue "Write NumPy docstring for data_preprocessing_core()" by providing a detailed NumPy-style docstring for the data_preprocessing_core() function.

Summary:

The function data_preprocessing_core() performs comprehensive preprocessing on a dataset containing mixed data types. It prepares the data for machine learning by handling numerical, categorical, text, and datetime data, supporting flexible imputation, scaling, encoding, and vectorization methods. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.

Docstring Sections Preview:

Description

"""
Performs comprehensive preprocessing on a dataset containing mixed data types.

This function prepares a dataset for machine learning by handling numerical, categorical, text, and datetime data. It supports flexible imputation, scaling, encoding, and vectorization methods to cater to a wide range of preprocessing needs. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.
"""

Parameters

"""
Parameters
----------
df : pd.DataFrame
    The DataFrame to preprocess.
x_cols : list of str
    List of feature column names in `df` to include in the preprocessing.
y_col : str
    The name of the target variable column in `df`.
data_state : str
    Specifies the initial state of the data ('unprocessed' or 'preprocessed'). Default is 'unprocessed'.
test_size : float, optional
    Proportion of the dataset to include in the test split. Default is 0.2.
random_state : int, optional
    Controls the shuffling applied to the data before applying the split. Default is 42.
numeric_imputer : sklearn imputer object, optional
    The imputation transformer for handling missing values in numerical data. Default is SimpleImputer(strategy='median').
numeric_scaler : sklearn scaler object, optional
    The scaling transformer for numerical data. Default is StandardScaler().
categorical_imputer : sklearn imputer object, optional
    The imputation transformer for handling missing values in categorical data. Default is SimpleImputer(strategy='constant', fill_value='missing').
categorical_encoder : sklearn encoder object, optional
    The encoding transformer for categorical data. Default is OneHotEncoder(handle_unknown='ignore').
text_vectorizer : sklearn vectorizer object, optional
    The vectorization transformer for text data. Default is CountVectorizer().
datetime_transformer : callable, optional
    The transformation operation for datetime data. Default extracts year, month, and day as separate features.
verbose : int, optional
    The higher value the more output and information the user receives. Default is 1.
"""

Returns

"""
Returns
-------
x_train_processed : ndarray
    The preprocessed training feature set.
x_test_processed : ndarray
    The preprocessed test feature set.
y_train : Series
    The training target variable.
y_test : Series
    The test target variable.
task_type : str
    The type of machine learning task inferred from the target variable ('regression' or 'classification').
"""

Raises

"""
Raises
------
TypeError
    - If 'df' is not a pandas DataFrame.
    - If 'x_cols' is not a list of strings.
    - If 'y_col' is not a string.
    - If 'data_state' is not a string.
    - If 'test_size' is not a float.
    - If 'random_state' is not an integer.
    - If 'verbose' is not an integer
    - If numeric_imputer, numeric_scaler, categorical_imputer, categorical_encoder, text_vectorizer, or datetime_transformer do not support the required interface.
ValueError
    - If the `df` is empty, indicating that there's no data to evaluate.
    - If 'data_state' is not 'unprocessed' or 'preprocessed'.
    - If 'y_col' is not found in 'df'.
    - If specified 'x_cols' are not present in 'df'.
    - If 'test_size' is not between 0 and 1.
    - If 'df' does not contain enough data to split according to 'test_size'.
"""

Examples

"""
Examples
--------
>>> df = pd.DataFrame({
...     'Age': np.random.randint(18, 35, size=100),
...     'Salary': np.random.normal(50000, 12000, size=100),
...     'Department': np.random.choice(['HR', 'Tech', 'Marketing'], size=100),
...     'Review': ['Good review']*50 + ['Bad review']*50,
...     'Employment Date': pd.date_range(start='2010-01-01', periods=100, freq='M')
... })
>>> x_cols = ['Age', 'Salary', 'Department', 'Review', 'Employment Date']
>>> y_col = 'Salary'
>>> processed_data = data_preprocessing_core(df, x_cols, y_col, test_size=0.25, random_state=123)
"""

ETA444 / datasafari