ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement evaluate_dtype() enhanced functionality #81

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description: The evaluate_dtype function currently categorizes columns in a DataFrame as either numerical or categorical based on their data types. However, in many scenarios, numerical columns with a limited range of unique values may functionally serve as categorical columns. This enhancement aims to improve the function's capability to infer categorical data from numerical columns, providing more flexibility and accuracy in data type evaluation.

Proposed Changes:

Expected Outcome: With this enhancement, users will have a more comprehensive and flexible tool for evaluating data types in their DataFrame. It will enable more accurate identification of categorical data, especially in cases where numerical columns exhibit limited variability.

Additional Context: Improving the functionality of the evaluate_dtype function aligns with our goal of providing robust and user-friendly data analysis tools. This enhancement addresses a common need in data preprocessing and analysis workflows, enhancing the overall usability and effectiveness of our library.

ETA444 commented 6 months ago

Implementation Summary:

The evaluate_dtype() function evaluates and categorizes data types of specified columns in a DataFrame, with enhanced handling for numerical data that may functionally serve as categorical data. The function identifies columns as numerical, categorical, text, or datetime based on their properties and returns this information either as a dictionary or a list of booleans, depending on the specified output format.

Code Breakdown:

  1. Initial Setup and Error Handling:

    • Purpose: Validate input types and values.
    # Error-handling #
    # TypeErrors
    if not isinstance(df, pd.DataFrame):
       raise TypeError("evaluate_dtype(): The 'df' parameter must be a pandas DataFrame.")
    if not isinstance(col_names, list):
       raise TypeError("evaluate_dtype(): The 'col_names' parameter must be a list.")
    elif not all(isinstance(col, str) for col in col_names):
       raise TypeError("evaluate_dtype(): All elements in the 'col_names' list must be strings representing column names.")
    if not isinstance(max_unique_values_ratio, (float, int)):
       raise TypeError("evaluate_dtype(): The 'max_unique_values_ratio' must be a float or integer.")
    if not isinstance(min_unique_values, int):
       raise TypeError("evaluate_dtype(): The 'minUniqueValues' must be an integer.")
    if not isinstance(string_length_threshold, int):
       raise TypeError("evaluate_dtype(): The 'string_length_threshold' must be an integer.")
    if not isinstance(small_dataset_threshold, int):
       raise TypeError("evaluate_dtype(): The 'small_dataset_threshold' must be an integer.")
    if not isinstance(output, str):
       raise TypeError("evaluate_dtype(): The 'output' parameter must be a string.")
    
    # ValueErrors
    if df.empty:
       raise ValueError("evaluate_dtype(): The input DataFrame is empty.")
    if max_unique_values_ratio < 0 or max_unique_values_ratio > 1:
       raise ValueError("evaluate_dtype(): The 'max_unique_values_ratio' must be between 0 and 1.")
    if min_unique_values < 1:
       raise ValueError("evaluate_dtype(): The 'min_unique_values' must be at least 1.")
    if string_length_threshold <= 0:
       raise ValueError("evaluate_dtype(): The 'string_length_threshold' must be greater than 0.")
    if small_dataset_threshold <= 0:
       raise ValueError("evaluate_dtype(): The 'small_dataset_threshold' must be greater than 0.")
    if len(col_names) == 0:
       raise ValueError("evaluate_dtype(): The 'col_names' list must contain at least one column name.")
    missing_cols = [col for col in col_names if col not in df.columns]
    if missing_cols:
       raise ValueError(f"evaluate_dtype(): The following columns were not found in the DataFrame: {', '.join(missing_cols)}")
    valid_outputs = ['dict', 'list_n', 'list_c', 'list_d', 'list_t']
    if output not in valid_outputs:
       raise ValueError(f"evaluate_dtype(): Invalid output '{output}'. Valid options are: {', '.join(valid_outputs)}")
    • Explanation:
      • The function first checks if the input types are correct and then validates the input values. It ensures the parameters are appropriate for further analysis.
    # Warn about small dataset size #
    if df.shape[0] < 60:
       warnings.warn(
           f"evaluate_dtype: Dataset size ({df.shape[0]}) is smaller than 60. "
           f"Evaluation may be inaccurate.",
           UserWarning
       )
    • Explanation:
      • A warning is issued if the dataset size is smaller than 60, cautioning that the evaluation might be inaccurate.
  2. Evaluation of Data Types:

    • Purpose: Determine the data type of each specified column.
    # Main Function #
    data_type_dictionary = {}
    for col in col_names:
       if is_numeric_dtype(df[col]):
           num_unique_values = df[col].nunique()
           total_values = len(df[col])
           if (num_unique_values <= min_unique_values) or (num_unique_values / total_values <= max_unique_values_ratio):
               data_type_dictionary[col] = 'categorical'
           else:
               data_type_dictionary[col] = 'numerical'
       elif is_string_dtype(df[col]):
           avg_str_length = df[col].dropna().apply(len).mean()
           num_unique_values = df[col].nunique()
           total_values = len(df[col])
           if total_values <= small_dataset_threshold and num_unique_values < min_unique_values:
               data_type_dictionary[col] = 'categorical'
           elif avg_str_length > string_length_threshold or num_unique_values / total_values > max_unique_values_ratio:
               data_type_dictionary[col] = 'text'
           else:
               data_type_dictionary[col] = 'categorical'
       elif is_datetime64_any_dtype(df[col]):
           data_type_dictionary[col] = 'datetime'
       else:
           data_type_dictionary[col] = 'categorical'
    • Explanation:
      • The function iterates over the specified columns to determine their data types.
      • Numerical Data:
      • If the column is numerical, it's classified as categorical if it has few unique values or if the ratio of unique to total values is low.
      • Otherwise, it's classified as numerical.
      • String Data:
      • If the column is a string, it's classified as text if the average string length exceeds a threshold or if the ratio of unique to total values is high.
      • Otherwise, it's classified as categorical.
      • Datetime Data:
      • If the column is of datetime type, it's classified as datetime.
      • Other Data:
      • Columns that do not fit any of the above categories are classified as categorical.
  3. Formatting the Output:

    • Purpose: Format the results based on the specified output type.
    if output.lower() == 'dict':
       return data_type_dictionary
    elif output.lower() == 'list_n':
       numerical_type_list = [dtype == 'numerical' for dtype in data_type_dictionary.values()]
       return numerical_type_list
    elif output.lower() == 'list_c':
       categorical_type_list = [dtype == 'categorical' for dtype in data_type_dictionary.values()]
       return categorical_type_list
    elif output.lower() == 'list_d':
       datetime_type_list = [dtype == 'datetime' for dtype in data_type_dictionary.values()]
       return datetime_type_list
    elif output.lower() == 'list_t':
       text_type_list = [dtype == 'text' for dtype in data_type_dictionary.values()]
       return text_type_list
    • Explanation:
      • The function formats the results based on the specified output type.
      • Dictionary Output (dict):
      • Returns a dictionary mapping column names to their determined data types.
      • List Output (list_n, list_c, list_d, list_t):
      • Returns a list of booleans indicating whether each column is of a particular data type (numerical, categorical, datetime, or text, respectively).

Link to Full Code: evaluate_dtype.py.