ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Construct tests for data_preprocessing_core() #100

Closed ETA444 closed 6 months ago

ETA444 commented 6 months ago

Summary of Unit Tests for data_preprocessing_core()

The data_preprocessing_core() function is crucial for preparing data for model training and evaluation in the predict_ml() machine learning workflow. It involves transforming data according to specified configurations, handling missing values, encoding categorical variables, and splitting the data into training and testing subsets. The unit tests are designed to ensure this function can handle various data types and configurations without errors and performs expected operations correctly.

Detailed Test Descriptions

  1. Type Validation Tests:

    • Invalid DataFrame Type: Checks if a TypeError is raised when the input is not a pandas DataFrame.
    • Invalid x_cols Type: Verifies that a TypeError is raised when x_cols is not a list of strings.
    • Invalid y_col Type: Ensures a TypeError is raised when y_col is not a string.
    • Invalid data_state Type: Checks for a TypeError when data_state is not a string.
    • Invalid test_size Type: Asserts that a TypeError is raised for non-float test_size.
    • Invalid random_state Type: Tests for a TypeError when random_state is not an integer.
    • Invalid verbose Type: Verifies that a TypeError is raised when verbose is not an integer.
  2. Value and Compatibility Checks:

    • Invalid test_size Value: Ensures a ValueError is raised if test_size is not between 0 and 1.
    • Empty DataFrame: Checks if a ValueError is raised when an empty DataFrame is processed.
    • Invalid Component Types: Tests various components (imputers, scalers, encoders) to ensure they comply with required interfaces like fit_transform.
  3. Existence and Integrity Checks:

    • Missing y_col: Verifies a ValueError is raised if y_col is not found in the DataFrame.
    • Missing x_cols: Checks for a ValueError when specified x_cols are not found in the DataFrame.
    • Test Size Too Large: Ensures that a ValueError is raised if there isn’t enough data to satisfy the test_size requirement.

Example Test Code

Here's an example of how a type validation test is implemented:

def test_invalid_df_type(sample_data_dpc):
    """ Test TypeError is raised when df is not a DataFrame. """
    with pytest.raises(TypeError):
        data_preprocessing_core("not_a_dataframe", ["Age"], "Salary", "unprocessed")

This test ensures that the function correctly identifies when the input df is not a pandas DataFrame and raises the appropriate TypeError.

Full Test Suite

You can access the complete suite code at: Data Preprocessing Test Suite.