ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Construct tests for evaluate_dtype() #83

Closed ETA444 closed 6 months ago

ETA444 commented 6 months ago

Summary of Unit Tests for evaluate_dtype()

The evaluate_dtype() function assesses the data types of specified columns in a DataFrame, which is crucial for ensuring data is appropriately preprocessed for analytical tasks. The test suite covers various error-handling scenarios to ensure robustness and functionality tests to verify the accurate identification of data types.

Detailed Breakdown of Tests

Error-Handling Tests

  1. Invalid DataFrame Type:

    • Ensures a TypeError is raised if the input is not a DataFrame.
  2. Invalid Column Names Type:

    • Verifies a TypeError for non-list col_names.
  3. Non-string Column Names:

    • Checks for TypeError when column names are not strings.
  4. Invalid Max Unique Values Ratio Type:

    • Ensures a TypeError for non-float max_unique_values_ratio.
  5. Invalid Min Unique Values Type:

    • Checks for TypeError when min_unique_values is not an integer.
  6. Invalid String Length Threshold Type:

    • Verifies a TypeError for non-integer string_length_threshold.
  7. Invalid Output Type:

    • Ensures a TypeError for non-string output parameter.
  8. Empty DataFrame:

    • Checks for a ValueError when the DataFrame is empty.
  9. Invalid Max Unique Values Ratio:

    • Verifies handling of out-of-bounds max_unique_values_ratio.
  10. Invalid Min Unique Values:

    • Ensures correct error handling for invalid min_unique_values.
  11. Invalid String Length Threshold:

    • Checks for errors in string_length_threshold values.
  12. Empty Column Names List:

    • Verifies a ValueError for empty col_names.
  13. Nonexistent Column:

    • Ensures correct error handling for nonexistent columns.
  14. Invalid Output Option:

    • Checks for a ValueError with invalid output options.

Functionality Tests

  1. Numerical Data Identification:

    • Confirms correct identification of numerical data.
  2. Categorical Identification from Numerical:

    • Verifies numerical data treated as categorical based on unique values.
  3. Text Data Identification:

    • Checks for correct identification of text data based on string length.
  4. Datetime Data Identification:

    • Ensures correct identification of datetime columns.
  5. Multiple Columns Identification:

    • Tests accurate data type identification for multiple columns.
  6. Output for Numerical Data:

    • Confirms list output matches expected numerical data types.
  7. Output for Categorical Data:

    • Verifies list output for categorical data identification.
  8. Output for Datetime Data:

    • Ensures correct list output for datetime columns.
  9. Output for Text Data:

    • Checks list output for text data based on string length.
  10. Handling Small Categorical Data:

    • Verifies identification of categorical data in small datasets.
  11. Handling Small Text Data:

    • Ensures text identification in smaller datasets.
  12. Handling Small Numerical Data:

    • Checks numerical data identification in reduced datasets.
  13. Handling Small Mixed Data Types:

    • Tests identification across multiple data types in small datasets.

Example Code from the Suite

Here's an example test code snippet for the "Numerical Data Identification":

def test_evaluate_dtype_numerical_identification(sample_dataframe):
    """Test correct identification of numerical data types."""
    result = evaluate_dtype(sample_dataframe, ['Income'], output='dict')
    assert result['Income'] == 'numerical', "Income should be identified as numerical"

This test verifies that the evaluate_dtype() function correctly identifies a column (Income) as numerical based on its contents.

Full Test Suite Access

For a comprehensive view and to explore more about the tests, you can access the full test suite here: Evaluate Dtype Test Suite.