Implement error handling for evaluate_dtype()

Implementation Summary

The evaluate_dtype() function is designed to perform advanced data type evaluations on DataFrame columns. It extends beyond simple type checks by considering unique value distributions and other factors that might affect whether a column should be treated as categorical, numerical, text, or datetime. This detailed error handling ensures that the function robustly handles a variety of input scenarios and guides users in providing correct inputs.

Detailed Error Handling Breakdown

Type Validations

DataFrame Type Check
- Confirms the input df is a pandas DataFrame. This check is critical as the function operations are tailored to DataFrame objects.

if not isinstance(df, pd.DataFrame):
    raise TypeError("evaluate_dtype(): The 'df' parameter must be a pandas DataFrame.")

List and String Type Checks for col_names
- Validates that col_names is a list and its elements are strings, ensuring the columns can be accurately referenced within the DataFrame.

if not isinstance(col_names, list):
    raise TypeError("evaluate_dtype(): The 'col_names' parameter must be a list.")
elif not all(isinstance(col, str) for col in col_names):
    raise TypeError("evaluate_dtype(): All elements in the 'col_names' list must be strings representing column names.")

Type Checks for Other Parameters
- Ensures that max_unique_values_ratio, min_unique_values, string_length_threshold, small_dataset_threshold, and output are of the appropriate types for their intended uses.

if not isinstance(max_unique_values_ratio, (float, int)):
    raise TypeError("evaluate_dtype(): The 'max_unique_values_ratio' must be a float or integer.")
# Similar checks for other parameters

Value Validations

DataFrame Emptiness Check
- Ensures the DataFrame is not empty, which is necessary for any meaningful data type evaluation.

if df.empty:
    raise ValueError("evaluate_dtype(): The input DataFrame is empty.")

Validation of Numerical Ranges and Thresholds
- Checks that parameters like max_unique_values_ratio, min_unique_values, etc., are within their logical bounds to prevent runtime errors during calculations.

if max_unique_values_ratio < 0 or max_unique_values_ratio > 1:
    raise ValueError("evaluate_dtype(): The 'max_unique_values_ratio' must be between 0 and 1.")
# Similar checks for other numerical parameters

Column Presence Validation
- Verifies that each column listed in col_names exists in the DataFrame, preventing errors when accessing non-existent DataFrame columns.

missing_cols = [col for col in col_names if col not in df.columns]
if missing_cols:
    raise ValueError(f"evaluate_dtype(): The following columns were not found in the DataFrame: {', '.join(missing_cols)}")

Output Type Check
- Ensures the output parameter matches one of the allowed options, guiding the structure of the function's return value.

valid_outputs = ['dict', 'list_n', 'list_c', 'list_d', 'list_t']
if output not in valid_outputs:
    raise ValueError(f"evaluate_dtype(): Invalid output '{output}'. Valid options are: {', '.join(valid_outputs)}")

ETA444 / datasafari

Implement error handling for evaluate_dtype() #84

Implementation Summary

Detailed Error Handling Breakdown

Type Validations

Value Validations

Link to Full Code