ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement error handling for evaluate_dtype() #84

Closed ETA444 closed 4 months ago

ETA444 commented 4 months ago

Implementation Summary

The evaluate_dtype() function is designed to perform advanced data type evaluations on DataFrame columns. It extends beyond simple type checks by considering unique value distributions and other factors that might affect whether a column should be treated as categorical, numerical, text, or datetime. This detailed error handling ensures that the function robustly handles a variety of input scenarios and guides users in providing correct inputs.

Detailed Error Handling Breakdown

Type Validations

if not isinstance(df, pd.DataFrame):
    raise TypeError("evaluate_dtype(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(col_names, list):
    raise TypeError("evaluate_dtype(): The 'col_names' parameter must be a list.")
elif not all(isinstance(col, str) for col in col_names):
    raise TypeError("evaluate_dtype(): All elements in the 'col_names' list must be strings representing column names.")
if not isinstance(max_unique_values_ratio, (float, int)):
    raise TypeError("evaluate_dtype(): The 'max_unique_values_ratio' must be a float or integer.")
# Similar checks for other parameters

Value Validations

if df.empty:
    raise ValueError("evaluate_dtype(): The input DataFrame is empty.")
if max_unique_values_ratio < 0 or max_unique_values_ratio > 1:
    raise ValueError("evaluate_dtype(): The 'max_unique_values_ratio' must be between 0 and 1.")
# Similar checks for other numerical parameters
missing_cols = [col for col in col_names if col not in df.columns]
if missing_cols:
    raise ValueError(f"evaluate_dtype(): The following columns were not found in the DataFrame: {', '.join(missing_cols)}")
valid_outputs = ['dict', 'list_n', 'list_c', 'list_d', 'list_t']
if output not in valid_outputs:
    raise ValueError(f"evaluate_dtype(): Invalid output '{output}'. Valid options are: {', '.join(valid_outputs)}")

Link to Full Code