ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new explore_cat() method: 'entropy' #26

Closed ETA444 closed 6 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The entropy method calculates the entropy for each specified categorical variable, providing a quantitative measure of data diversity.

How it operates:

For each variable in the categorical_variables list, the method computes the entropy using the calculate_entropy function. It appends the entropy value and interpretation, along with a tip on interpretation, to the result list. Additionally, it includes a tip for more details on entropy calculation.

Usage:

To calculate and display the entropy of categorical variables using the entropy method:

explore_cat(df, ['Category1', 'Category2'], method='entropy')

This will compute the entropy for each specified categorical variable and provide insights into the diversity of data within those variables.

Notes:

The entropy value serves as a measure of unpredictability or diversity within each categorical variable. Higher entropy values indicate greater diversity, while lower values suggest more uniform distributions. For further details on entropy calculation, the calculate_entropy function's docstring can be accessed by running: print(calculate_entropy.__doc__).

ETA444 commented 6 months ago

Implementation Summary:

The explore_cat() function analyzes categorical variables in a DataFrame, including calculating entropy, which measures unpredictability or diversity within a variable.

Purpose:

The function provides detailed insights into categorical variables, focusing here on the entropy analysis.

Code Breakdown:

  1. Purpose of the Function:

    • Purpose: To analyze categorical variables in a DataFrame.
    def explore_cat(
           df: pd.DataFrame,
           categorical_variables: List[str],
           method: str = 'all',
           output: str = 'print'
    ) -> Optional[str]:
    • The function provides various methods to analyze categorical data, including calculating entropy.
  2. Parameter Definitions:

    • Purpose: To define the function's parameters.
    Parameters
    ----------
    df : pd.DataFrame
       The DataFrame containing the categorical data to analyze.
    categorical_variables : list
       A list of strings representing the column names in `df` to be analyzed.
    method : str, optional, default 'all'
       Specifies the analysis method to apply. Options include:
       - 'unique_values' for listing unique values of each categorical variable.
       - 'counts_percentage' for counting frequencies and showing percentages.
       - 'entropy' for calculating the entropy of each variable.
       - 'all' to perform all available analyses sequentially.
    output : str, optional, default 'print'
       Determines the output format. Options include:
       - 'print' to print the analysis results to the console.
       - 'return' to return the analysis results as a formatted string or dictionary, depending on the analysis type.
  3. Return Definition:

    • Purpose: To define the function's return type.
    Returns
    -------
    str or None
       - For 'unique_values' and 'counts_percentage', returns a string if output is 'return'.
       - For 'entropy', returns a dictionary mapping variables to tuples of entropy values and interpretations if output is 'return'.
       - If 'output' is set to 'return' and 'method' is 'all', returns a comprehensive summary of all analyses as a string.
  4. Entropy Calculation:

    • Purpose: To calculate and interpret entropy for categorical variables.
    if method.lower() in ['entropy', 'all']:
       result.append("<<______ENTROPY OF CATEGORICAL VARIABLES______>>\n")
       result.append("Tip: Higher entropy indicates greater diversity.*\n")
    
       for variable_name in categorical_variables:
           entropy_val, interpretation = calculate_entropy(df[variable_name])
           result.append(f"Entropy of ['{variable_name}']: {entropy_val:.3f} {interpretation}\n")
    
       result.append("* For more details on entropy, run: 'print(calculate_entropy.__doc__)'.\n")
    • The entropy method calculates and interprets entropy for the specified categorical variables, providing insight into the diversity of each variable.

See the Full Function