ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_cat() method: 'uniform_mapping' #6

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The uniform_mapping method allows for manual mapping of categories to address specific cases:

How it operates:

The method iterates through each categorical variable in the dataframe and checks if it exists in the provided abbreviation mapping dictionary. If a mapping rule is found, it applies the mapping to transform the categories accordingly. The transformed dataframe and a dataframe containing only the uniform columns are returned.

Usage:

To use the uniform_mapping method:

abbreviation_map = {
    'Category': {
        'high   school': 'high school',
        'hgh schl': 'high school'
    }
}
final_transformed_df, final_transformed_cols = transform_cat(smart_transformed_df, ['Category'], method='uniform_mapping', abbreviation_map=abbreviation_map)

Sanity Check:

A sanity check is performed to compare the shape of the original dataframe with the transformed dataframe.

ETA444 commented 7 months ago

Implementation Summary:

The 'uniform_mapping' method within transform_cat() allows for manual mapping of categories based on user-defined rules to handle specific cases where automated transformations might not suffice. This is useful for correcting typos, consolidating similar categories, or applying specific transformations.

Purpose:

The purpose of this method is to provide flexibility in transforming categorical data by allowing users to specify how certain categories should be mapped, ensuring that specific cases are addressed appropriately.

Code Breakdown:

  1. Method Header:

    • Purpose: To clearly indicate the start of the 'uniform_mapping' method implementation and provide context.
    if method.lower() == 'uniform_mapping' and abbreviation_map:
       print(f"< MANUAL CATEGORY MAPPING >")
       print(" This method allows for manual mapping of categories to address specific cases:")
       print("  ✔ Maps categories based on user-defined rules.")
       print("  ✔ Useful for stubborn categories that automated methods can't uniformly transform.")
       print("✎ Note: Ensure your mapping dictionary is comprehensive for the best results.\n")
  2. Initialize DataFrame and Uniform Columns:

    • Purpose: To prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    uniform_columns = pd.DataFrame()
  3. Mapping Loop:

    • Purpose: To apply the user-defined mappings for each specified categorical variable.
    for variable in categorical_variables:
       if variable in abbreviation_map:
           # apply mapping
           transformed_df[variable] = transformed_df[variable].map(lambda x: abbreviation_map[variable].get(x, x))
           uniform_columns = pd.concat([uniform_columns, transformed_df[[variable]]], axis=1)
    
           print(f"\n['{variable}'] Category Mapping\n")
           print(f"Categories BEFORE mapping ({len(df[variable].unique())}): {df[variable].unique()}\n")
           print(f"Categories AFTER mapping ({len(transformed_df[variable].unique())}): {transformed_df[variable].unique()}\n")
  4. Output Results and Sanity Check:

    • Purpose: To display the results and perform a sanity check.
    print("< SANITY CHECK >")
    print(f"  ➡ Original dataframe shape: {df.shape}")
    print(f"  ➡ Transformed dataframe shape: {transformed_df.shape}\n")
    
    return transformed_df, uniform_columns

See the Full Function:

You can refer to the complete implementation of the transform_cat() function, including the 'uniform_mapping' method, on GitHub: transform_cat().