Closed ETA444 closed 6 months ago
Implementation Summary:
The evaluate_dtype()
function evaluates and categorizes data types of specified columns in a DataFrame, with enhanced handling for numerical data that may functionally serve as categorical data. The function identifies columns as numerical, categorical, text, or datetime based on their properties and returns this information either as a dictionary or a list of booleans, depending on the specified output format.
Code Breakdown:
Initial Setup and Error Handling:
# Error-handling #
# TypeErrors
if not isinstance(df, pd.DataFrame):
raise TypeError("evaluate_dtype(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(col_names, list):
raise TypeError("evaluate_dtype(): The 'col_names' parameter must be a list.")
elif not all(isinstance(col, str) for col in col_names):
raise TypeError("evaluate_dtype(): All elements in the 'col_names' list must be strings representing column names.")
if not isinstance(max_unique_values_ratio, (float, int)):
raise TypeError("evaluate_dtype(): The 'max_unique_values_ratio' must be a float or integer.")
if not isinstance(min_unique_values, int):
raise TypeError("evaluate_dtype(): The 'minUniqueValues' must be an integer.")
if not isinstance(string_length_threshold, int):
raise TypeError("evaluate_dtype(): The 'string_length_threshold' must be an integer.")
if not isinstance(small_dataset_threshold, int):
raise TypeError("evaluate_dtype(): The 'small_dataset_threshold' must be an integer.")
if not isinstance(output, str):
raise TypeError("evaluate_dtype(): The 'output' parameter must be a string.")
# ValueErrors
if df.empty:
raise ValueError("evaluate_dtype(): The input DataFrame is empty.")
if max_unique_values_ratio < 0 or max_unique_values_ratio > 1:
raise ValueError("evaluate_dtype(): The 'max_unique_values_ratio' must be between 0 and 1.")
if min_unique_values < 1:
raise ValueError("evaluate_dtype(): The 'min_unique_values' must be at least 1.")
if string_length_threshold <= 0:
raise ValueError("evaluate_dtype(): The 'string_length_threshold' must be greater than 0.")
if small_dataset_threshold <= 0:
raise ValueError("evaluate_dtype(): The 'small_dataset_threshold' must be greater than 0.")
if len(col_names) == 0:
raise ValueError("evaluate_dtype(): The 'col_names' list must contain at least one column name.")
missing_cols = [col for col in col_names if col not in df.columns]
if missing_cols:
raise ValueError(f"evaluate_dtype(): The following columns were not found in the DataFrame: {', '.join(missing_cols)}")
valid_outputs = ['dict', 'list_n', 'list_c', 'list_d', 'list_t']
if output not in valid_outputs:
raise ValueError(f"evaluate_dtype(): Invalid output '{output}'. Valid options are: {', '.join(valid_outputs)}")
# Warn about small dataset size #
if df.shape[0] < 60:
warnings.warn(
f"evaluate_dtype: Dataset size ({df.shape[0]}) is smaller than 60. "
f"Evaluation may be inaccurate.",
UserWarning
)
Evaluation of Data Types:
# Main Function #
data_type_dictionary = {}
for col in col_names:
if is_numeric_dtype(df[col]):
num_unique_values = df[col].nunique()
total_values = len(df[col])
if (num_unique_values <= min_unique_values) or (num_unique_values / total_values <= max_unique_values_ratio):
data_type_dictionary[col] = 'categorical'
else:
data_type_dictionary[col] = 'numerical'
elif is_string_dtype(df[col]):
avg_str_length = df[col].dropna().apply(len).mean()
num_unique_values = df[col].nunique()
total_values = len(df[col])
if total_values <= small_dataset_threshold and num_unique_values < min_unique_values:
data_type_dictionary[col] = 'categorical'
elif avg_str_length > string_length_threshold or num_unique_values / total_values > max_unique_values_ratio:
data_type_dictionary[col] = 'text'
else:
data_type_dictionary[col] = 'categorical'
elif is_datetime64_any_dtype(df[col]):
data_type_dictionary[col] = 'datetime'
else:
data_type_dictionary[col] = 'categorical'
Formatting the Output:
if output.lower() == 'dict':
return data_type_dictionary
elif output.lower() == 'list_n':
numerical_type_list = [dtype == 'numerical' for dtype in data_type_dictionary.values()]
return numerical_type_list
elif output.lower() == 'list_c':
categorical_type_list = [dtype == 'categorical' for dtype in data_type_dictionary.values()]
return categorical_type_list
elif output.lower() == 'list_d':
datetime_type_list = [dtype == 'datetime' for dtype in data_type_dictionary.values()]
return datetime_type_list
elif output.lower() == 'list_t':
text_type_list = [dtype == 'text' for dtype in data_type_dictionary.values()]
return text_type_list
dict
):list_n
, list_c
, list_d
, list_t
):Link to Full Code: evaluate_dtype.py.
Description: The
evaluate_dtype
function currently categorizes columns in a DataFrame as either numerical or categorical based on their data types. However, in many scenarios, numerical columns with a limited range of unique values may functionally serve as categorical columns. This enhancement aims to improve the function's capability to infer categorical data from numerical columns, providing more flexibility and accuracy in data type evaluation.Proposed Changes:
Expected Outcome: With this enhancement, users will have a more comprehensive and flexible tool for evaluating data types in their DataFrame. It will enable more accurate identification of categorical data, especially in cases where numerical columns exhibit limited variability.
Additional Context: Improving the functionality of the
evaluate_dtype
function aligns with our goal of providing robust and user-friendly data analysis tools. This enhancement addresses a common need in data preprocessing and analysis workflows, enhancing the overall usability and effectiveness of our library.