The evaluate_dtype() function assesses the data types of specified columns in a DataFrame, which is crucial for ensuring data is appropriately preprocessed for analytical tasks. The test suite covers various error-handling scenarios to ensure robustness and functionality tests to verify the accurate identification of data types.
Detailed Breakdown of Tests
Error-Handling Tests
Invalid DataFrame Type:
Ensures a TypeError is raised if the input is not a DataFrame.
Invalid Column Names Type:
Verifies a TypeError for non-list col_names.
Non-string Column Names:
Checks for TypeError when column names are not strings.
Invalid Max Unique Values Ratio Type:
Ensures a TypeError for non-float max_unique_values_ratio.
Invalid Min Unique Values Type:
Checks for TypeError when min_unique_values is not an integer.
Invalid String Length Threshold Type:
Verifies a TypeError for non-integer string_length_threshold.
Invalid Output Type:
Ensures a TypeError for non-string output parameter.
Empty DataFrame:
Checks for a ValueError when the DataFrame is empty.
Invalid Max Unique Values Ratio:
Verifies handling of out-of-bounds max_unique_values_ratio.
Invalid Min Unique Values:
Ensures correct error handling for invalid min_unique_values.
Invalid String Length Threshold:
Checks for errors in string_length_threshold values.
Empty Column Names List:
Verifies a ValueError for empty col_names.
Nonexistent Column:
Ensures correct error handling for nonexistent columns.
Invalid Output Option:
Checks for a ValueError with invalid output options.
Functionality Tests
Numerical Data Identification:
Confirms correct identification of numerical data.
Categorical Identification from Numerical:
Verifies numerical data treated as categorical based on unique values.
Text Data Identification:
Checks for correct identification of text data based on string length.
Datetime Data Identification:
Ensures correct identification of datetime columns.
Multiple Columns Identification:
Tests accurate data type identification for multiple columns.
Output for Numerical Data:
Confirms list output matches expected numerical data types.
Output for Categorical Data:
Verifies list output for categorical data identification.
Output for Datetime Data:
Ensures correct list output for datetime columns.
Output for Text Data:
Checks list output for text data based on string length.
Handling Small Categorical Data:
Verifies identification of categorical data in small datasets.
Handling Small Text Data:
Ensures text identification in smaller datasets.
Handling Small Numerical Data:
Checks numerical data identification in reduced datasets.
Handling Small Mixed Data Types:
Tests identification across multiple data types in small datasets.
Example Code from the Suite
Here's an example test code snippet for the "Numerical Data Identification":
def test_evaluate_dtype_numerical_identification(sample_dataframe):
"""Test correct identification of numerical data types."""
result = evaluate_dtype(sample_dataframe, ['Income'], output='dict')
assert result['Income'] == 'numerical', "Income should be identified as numerical"
This test verifies that the evaluate_dtype() function correctly identifies a column (Income) as numerical based on its contents.
Full Test Suite Access
For a comprehensive view and to explore more about the tests, you can access the full test suite here: Evaluate Dtype Test Suite.
Summary of Unit Tests for
evaluate_dtype()
The
evaluate_dtype()
function assesses the data types of specified columns in a DataFrame, which is crucial for ensuring data is appropriately preprocessed for analytical tasks. The test suite covers various error-handling scenarios to ensure robustness and functionality tests to verify the accurate identification of data types.Detailed Breakdown of Tests
Error-Handling Tests
Invalid DataFrame Type:
TypeError
is raised if the input is not a DataFrame.Invalid Column Names Type:
TypeError
for non-listcol_names
.Non-string Column Names:
TypeError
when column names are not strings.Invalid Max Unique Values Ratio Type:
TypeError
for non-floatmax_unique_values_ratio
.Invalid Min Unique Values Type:
TypeError
whenmin_unique_values
is not an integer.Invalid String Length Threshold Type:
TypeError
for non-integerstring_length_threshold
.Invalid Output Type:
TypeError
for non-stringoutput
parameter.Empty DataFrame:
ValueError
when the DataFrame is empty.Invalid Max Unique Values Ratio:
max_unique_values_ratio
.Invalid Min Unique Values:
min_unique_values
.Invalid String Length Threshold:
string_length_threshold
values.Empty Column Names List:
ValueError
for emptycol_names
.Nonexistent Column:
Invalid Output Option:
ValueError
with invalidoutput
options.Functionality Tests
Numerical Data Identification:
Categorical Identification from Numerical:
Text Data Identification:
Datetime Data Identification:
Multiple Columns Identification:
Output for Numerical Data:
Output for Categorical Data:
Output for Datetime Data:
Output for Text Data:
Handling Small Categorical Data:
Handling Small Text Data:
Handling Small Numerical Data:
Handling Small Mixed Data Types:
Example Code from the Suite
Here's an example test code snippet for the "Numerical Data Identification":
This test verifies that the
evaluate_dtype()
function correctly identifies a column (Income
) as numerical based on its contents.Full Test Suite Access
For a comprehensive view and to explore more about the tests, you can access the full test suite here: Evaluate Dtype Test Suite.