Construct tests for evaluate_dtype()

Summary of Unit Tests for `evaluate_dtype()`

The evaluate_dtype() function assesses the data types of specified columns in a DataFrame, which is crucial for ensuring data is appropriately preprocessed for analytical tasks. The test suite covers various error-handling scenarios to ensure robustness and functionality tests to verify the accurate identification of data types.

Detailed Breakdown of Tests

Error-Handling Tests

Invalid DataFrame Type:
- Ensures a TypeError is raised if the input is not a DataFrame.
Invalid Column Names Type:
- Verifies a TypeError for non-list col_names.
Non-string Column Names:
- Checks for TypeError when column names are not strings.
Invalid Max Unique Values Ratio Type:
- Ensures a TypeError for non-float max_unique_values_ratio.
Invalid Min Unique Values Type:
- Checks for TypeError when min_unique_values is not an integer.
Invalid String Length Threshold Type:
- Verifies a TypeError for non-integer string_length_threshold.
Invalid Output Type:
- Ensures a TypeError for non-string output parameter.
Empty DataFrame:
- Checks for a ValueError when the DataFrame is empty.
Invalid Max Unique Values Ratio:
- Verifies handling of out-of-bounds max_unique_values_ratio.
Invalid Min Unique Values:
- Ensures correct error handling for invalid min_unique_values.
Invalid String Length Threshold:
- Checks for errors in string_length_threshold values.
Empty Column Names List:
- Verifies a ValueError for empty col_names.
Nonexistent Column:
- Ensures correct error handling for nonexistent columns.
Invalid Output Option:
- Checks for a ValueError with invalid output options.

Functionality Tests

Numerical Data Identification:
- Confirms correct identification of numerical data.
Categorical Identification from Numerical:
- Verifies numerical data treated as categorical based on unique values.
Text Data Identification:
- Checks for correct identification of text data based on string length.
Datetime Data Identification:
- Ensures correct identification of datetime columns.
Multiple Columns Identification:
- Tests accurate data type identification for multiple columns.
Output for Numerical Data:
- Confirms list output matches expected numerical data types.
Output for Categorical Data:
- Verifies list output for categorical data identification.
Output for Datetime Data:
- Ensures correct list output for datetime columns.
Output for Text Data:
- Checks list output for text data based on string length.
Handling Small Categorical Data:
- Verifies identification of categorical data in small datasets.
Handling Small Text Data:
- Ensures text identification in smaller datasets.
Handling Small Numerical Data:
- Checks numerical data identification in reduced datasets.
Handling Small Mixed Data Types:
- Tests identification across multiple data types in small datasets.

Example Code from the Suite

Here's an example test code snippet for the "Numerical Data Identification":

def test_evaluate_dtype_numerical_identification(sample_dataframe):
    """Test correct identification of numerical data types."""
    result = evaluate_dtype(sample_dataframe, ['Income'], output='dict')
    assert result['Income'] == 'numerical', "Income should be identified as numerical"

This test verifies that the evaluate_dtype() function correctly identifies a column (Income) as numerical based on its contents.

Full Test Suite Access

For a comprehensive view and to explore more about the tests, you can access the full test suite here: Evaluate Dtype Test Suite.

ETA444 / datasafari