Implement evaluate_dtype() enhanced functionality

Implementation Summary:

The evaluate_dtype() function evaluates and categorizes data types of specified columns in a DataFrame, with enhanced handling for numerical data that may functionally serve as categorical data. The function identifies columns as numerical, categorical, text, or datetime based on their properties and returns this information either as a dictionary or a list of booleans, depending on the specified output format.

Code Breakdown:

Initial Setup and Error Handling:

Purpose: Validate input types and values.

# Error-handling #
# TypeErrors
if not isinstance(df, pd.DataFrame):
   raise TypeError("evaluate_dtype(): The 'df' parameter must be a pandas DataFrame.")
if not isinstance(col_names, list):
   raise TypeError("evaluate_dtype(): The 'col_names' parameter must be a list.")
elif not all(isinstance(col, str) for col in col_names):
   raise TypeError("evaluate_dtype(): All elements in the 'col_names' list must be strings representing column names.")
if not isinstance(max_unique_values_ratio, (float, int)):
   raise TypeError("evaluate_dtype(): The 'max_unique_values_ratio' must be a float or integer.")
if not isinstance(min_unique_values, int):
   raise TypeError("evaluate_dtype(): The 'minUniqueValues' must be an integer.")
if not isinstance(string_length_threshold, int):
   raise TypeError("evaluate_dtype(): The 'string_length_threshold' must be an integer.")
if not isinstance(small_dataset_threshold, int):
   raise TypeError("evaluate_dtype(): The 'small_dataset_threshold' must be an integer.")
if not isinstance(output, str):
   raise TypeError("evaluate_dtype(): The 'output' parameter must be a string.")

# ValueErrors
if df.empty:
   raise ValueError("evaluate_dtype(): The input DataFrame is empty.")
if max_unique_values_ratio < 0 or max_unique_values_ratio > 1:
   raise ValueError("evaluate_dtype(): The 'max_unique_values_ratio' must be between 0 and 1.")
if min_unique_values < 1:
   raise ValueError("evaluate_dtype(): The 'min_unique_values' must be at least 1.")
if string_length_threshold <= 0:
   raise ValueError("evaluate_dtype(): The 'string_length_threshold' must be greater than 0.")
if small_dataset_threshold <= 0:
   raise ValueError("evaluate_dtype(): The 'small_dataset_threshold' must be greater than 0.")
if len(col_names) == 0:
   raise ValueError("evaluate_dtype(): The 'col_names' list must contain at least one column name.")
missing_cols = [col for col in col_names if col not in df.columns]
if missing_cols:
   raise ValueError(f"evaluate_dtype(): The following columns were not found in the DataFrame: {', '.join(missing_cols)}")
valid_outputs = ['dict', 'list_n', 'list_c', 'list_d', 'list_t']
if output not in valid_outputs:
   raise ValueError(f"evaluate_dtype(): Invalid output '{output}'. Valid options are: {', '.join(valid_outputs)}")

Explanation:
- The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.

# Warn about small dataset size #
if df.shape[0] < 60:
   warnings.warn(
       f"evaluate_dtype: Dataset size ({df.shape[0]}) is smaller than 60. "
       f"Evaluation may be inaccurate.",
       UserWarning
   )

Explanation:
- A warning is issued if the dataset size is smaller than 60, cautioning that the evaluation might be inaccurate.

Evaluation of Data Types:

Purpose: Determine the data type of each specified column.

# Main Function #
data_type_dictionary = {}
for col in col_names:
   if is_numeric_dtype(df[col]):
       num_unique_values = df[col].nunique()
       total_values = len(df[col])
       if (num_unique_values <= min_unique_values) or (num_unique_values / total_values <= max_unique_values_ratio):
           data_type_dictionary[col] = 'categorical'
       else:
           data_type_dictionary[col] = 'numerical'
   elif is_string_dtype(df[col]):
       avg_str_length = df[col].dropna().apply(len).mean()
       num_unique_values = df[col].nunique()
       total_values = len(df[col])
       if total_values <= small_dataset_threshold and num_unique_values < min_unique_values:
           data_type_dictionary[col] = 'categorical'
       elif avg_str_length > string_length_threshold or num_unique_values / total_values > max_unique_values_ratio:
           data_type_dictionary[col] = 'text'
       else:
           data_type_dictionary[col] = 'categorical'
   elif is_datetime64_any_dtype(df[col]):
       data_type_dictionary[col] = 'datetime'
   else:
       data_type_dictionary[col] = 'categorical'

Explanation:
- The function iterates over the specified columns to determine their data types.
- Numerical Data:
- If the column is numerical, it's classified as categorical if it has few unique values or if the ratio of unique to total values is low.
- Otherwise, it's classified as numerical.
- String Data:
- If the column is a string, it's classified as text if the average string length exceeds a threshold or if the ratio of unique to total values is high.
- Otherwise, it's classified as categorical.
- Datetime Data:
- If the column is of datetime type, it's classified as datetime.
- Other Data:
- Columns that do not fit any of the above categories are classified as categorical.

Formatting the Output:

Purpose: Format the results based on the specified output type.

if output.lower() == 'dict':
   return data_type_dictionary
elif output.lower() == 'list_n':
   numerical_type_list = [dtype == 'numerical' for dtype in data_type_dictionary.values()]
   return numerical_type_list
elif output.lower() == 'list_c':
   categorical_type_list = [dtype == 'categorical' for dtype in data_type_dictionary.values()]
   return categorical_type_list
elif output.lower() == 'list_d':
   datetime_type_list = [dtype == 'datetime' for dtype in data_type_dictionary.values()]
   return datetime_type_list
elif output.lower() == 'list_t':
   text_type_list = [dtype == 'text' for dtype in data_type_dictionary.values()]
   return text_type_list

Explanation:
- The function formats the results based on the specified output type.
- Dictionary Output (dict):
- Returns a dictionary mapping column names to their determined data types.
- List Output (list_n, list_c, list_d, list_t):
- Returns a list of booleans indicating whether each column is of a particular data type (numerical, categorical, datetime, or text, respectively).

Link to Full Code: evaluate_dtype.py.

ETA444 / datasafari

Implement evaluate_dtype() enhanced functionality #81