This solution addresses the issue "Write NumPy docstring for data_preprocessing_core()" by providing a detailed NumPy-style docstring for the data_preprocessing_core() function.
Summary:
The function data_preprocessing_core() performs comprehensive preprocessing on a dataset containing mixed data types. It prepares the data for machine learning by handling numerical, categorical, text, and datetime data, supporting flexible imputation, scaling, encoding, and vectorization methods. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.
Docstring Sections Preview:
Description
"""
Performs comprehensive preprocessing on a dataset containing mixed data types.
This function prepares a dataset for machine learning by handling numerical, categorical, text, and datetime data. It supports flexible imputation, scaling, encoding, and vectorization methods to cater to a wide range of preprocessing needs. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.
"""
Parameters
"""
Parameters
----------
df : pd.DataFrame
The DataFrame to preprocess.
x_cols : list of str
List of feature column names in `df` to include in the preprocessing.
y_col : str
The name of the target variable column in `df`.
data_state : str
Specifies the initial state of the data ('unprocessed' or 'preprocessed'). Default is 'unprocessed'.
test_size : float, optional
Proportion of the dataset to include in the test split. Default is 0.2.
random_state : int, optional
Controls the shuffling applied to the data before applying the split. Default is 42.
numeric_imputer : sklearn imputer object, optional
The imputation transformer for handling missing values in numerical data. Default is SimpleImputer(strategy='median').
numeric_scaler : sklearn scaler object, optional
The scaling transformer for numerical data. Default is StandardScaler().
categorical_imputer : sklearn imputer object, optional
The imputation transformer for handling missing values in categorical data. Default is SimpleImputer(strategy='constant', fill_value='missing').
categorical_encoder : sklearn encoder object, optional
The encoding transformer for categorical data. Default is OneHotEncoder(handle_unknown='ignore').
text_vectorizer : sklearn vectorizer object, optional
The vectorization transformer for text data. Default is CountVectorizer().
datetime_transformer : callable, optional
The transformation operation for datetime data. Default extracts year, month, and day as separate features.
verbose : int, optional
The higher value the more output and information the user receives. Default is 1.
"""
Returns
"""
Returns
-------
x_train_processed : ndarray
The preprocessed training feature set.
x_test_processed : ndarray
The preprocessed test feature set.
y_train : Series
The training target variable.
y_test : Series
The test target variable.
task_type : str
The type of machine learning task inferred from the target variable ('regression' or 'classification').
"""
Raises
"""
Raises
------
TypeError
- If 'df' is not a pandas DataFrame.
- If 'x_cols' is not a list of strings.
- If 'y_col' is not a string.
- If 'data_state' is not a string.
- If 'test_size' is not a float.
- If 'random_state' is not an integer.
- If 'verbose' is not an integer
- If numeric_imputer, numeric_scaler, categorical_imputer, categorical_encoder, text_vectorizer, or datetime_transformer do not support the required interface.
ValueError
- If the `df` is empty, indicating that there's no data to evaluate.
- If 'data_state' is not 'unprocessed' or 'preprocessed'.
- If 'y_col' is not found in 'df'.
- If specified 'x_cols' are not present in 'df'.
- If 'test_size' is not between 0 and 1.
- If 'df' does not contain enough data to split according to 'test_size'.
"""
Written and accessible:
This solution addresses the issue "Write NumPy docstring for data_preprocessing_core()" by providing a detailed NumPy-style docstring for the
data_preprocessing_core()
function.Summary:
The function
data_preprocessing_core()
performs comprehensive preprocessing on a dataset containing mixed data types. It prepares the data for machine learning by handling numerical, categorical, text, and datetime data, supporting flexible imputation, scaling, encoding, and vectorization methods. The function automatically splits the data into training and test sets and applies the preprocessing steps defined by the user. It accommodates custom preprocessing steps for various data types, enhancing flexibility and control over the preprocessing pipeline.Docstring Sections Preview:
Description
Parameters
Returns
Raises
Examples