ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_cat() method: 'encode_freq' #9

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The encode_freq method transforms categorical variables based on the frequency of each category.

How it operates:

The method calculates the frequency of each category in the categorical variables and maps these frequencies to the original dataframe. Categories are replaced with their respective frequencies.

Usage:

To use the encode_freq method:

transformed_df, encoded_columns = transform_cat(your_df, your_columns, method='encode_freq')

Frequency encoding helps retain information about the category's prevalence and is useful for models where the frequency significance of categories impacts prediction.

Sanity Check:

A sanity check is performed to compare the shape of the original dataframe with the transformed dataframe.

ETA444 commented 7 months ago

Implementation Summary:

The 'encode_freq' method within transform_cat() replaces categorical values with their frequency counts. This method is useful for models where the prevalence of categories impacts predictions.

Purpose:

The purpose of this method is to transform categorical data based on the frequency of each category, which helps models to better understand how common each category is.

Code Breakdown:

  1. Method Header:

    • Purpose: To clearly indicate the start of the 'encode_freq' method implementation and provide context.
    if method.lower() == 'encode_freq':
       print(f"< FREQUENCY ENCODING TRANSFORMATION >")
       print(f" This method transforms categorical variables based on the frequency of each category.")
       print(f"✎ Note: Frequency encoding helps to retain the information about the category's prevalence.")
       print(f"☻ Tip: Useful for models where the frequency significance of categories impacts the prediction.\n")
  2. Initialize DataFrame:

    • Purpose: To prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    encoded_columns = pd.DataFrame()
  3. Encode Each Variable:

    • Purpose: To iterate over each variable and apply frequency encoding.
    for variable in categorical_variables:
       # calculate the frequency of each category
       frequency_map = transformed_df[variable].value_counts().to_dict()
    
       # map the frequencies to the original dataframe
       transformed_df[variable] = transformed_df[variable].map(frequency_map)
       encoded_columns = pd.concat([encoded_columns, transformed_df[[variable]]], axis=1)
       print(f"✔ '{variable}' has been frequency encoded.\n")
  4. Output Results:

    • Purpose: To display the results and perform a sanity check.
    print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only frequency encoded columns:\n{encoded_columns.head()}\n")
    print("☻ HOW TO - to catch the df's: `transformed_df, encoded_columns = transform_cat(your_df, your_columns, method='encode_freq')`.\n")
    print("< SANITY CHECK >")
    print(f"  ➡ Original dataframe shape: {df.shape}")
    print(f"  ➡ Transformed dataframe shape: {transformed_df.shape}\n")
    return transformed_df, encoded_columns

See the Full Function:

You can refer to the complete implementation of the transform_cat() function, including the 'encode_freq' method, on GitHub: transform_cat().