ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_num() method: 'power' #61

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description:


Method Functionality Idea:

The power transformation method raises numerical variables to specified powers, allowing for precise adjustments to the data distribution. This method provides flexibility in correcting skewness and normalizing distributions, which can improve statistical analysis and machine learning model performance. Users can choose between applying a uniform power value to all variables or specifying individual powers per variable using a power_map.

How it operates:

The method iterates through each numerical variable and applies the specified power transformation. If a power_map is provided, individual powers are applied per variable. Alternatively, if a single power value is provided, it is uniformly applied to all variables. Transformed columns are concatenated into a new DataFrame, preserving the original DataFrame structure.

Usage:

To perform power transformations on numerical variables:

transformed_df, power_transformed_columns = transform_num(your_df, your_numerical_variables, method='power', power=2)

This method returns both the DataFrame with transformed numerical variables and a DataFrame containing only the power-transformed columns.

Example with power_map:

# Perform power transformation on numerical variables 'Feature1' and 'Feature2' with individual powers specified in power_map
power_map = {'Feature1': 0.5, 'Feature2': 2}
transformed_df, power_transformed_columns = transform_num(df, ['Feature1', 'Feature2'], method='power', power_map=power_map)

Notes:

ETA444 commented 6 months ago

Implementation Summary:

The 'power' method raises numerical variables to specified powers, allowing for precise data distribution adjustments.

Purpose:

To apply a power transformation to numerical variables, which can help correct skewness and normalize the distribution, thereby improving the performance of statistical analyses and machine learning models.

Code Breakdown:

  1. Method Header:

    • Purpose: Clearly indicate the start of the 'power' method implementation and provide context.
    if method.lower() == 'power':
       print(f"< POWER TRANSFORMATION >")
       print(f" This method raises numerical variables to specified powers, allowing for precise data distribution adjustments.")
       print(f"  ✔ Individual powers can be set per variable using a 'power_map' for targeted transformations.")
       print(f"  ✔ Alternatively, a single 'power' value applies uniformly to all specified numerical variables.")
       print(f"  ✔ Facilitates skewness correction and distribution normalization to improve statistical analysis and ML model performance.\n")
       print(f"☻ Tip: A power of 0.5 (square root) often works well for right-skewed data, while a square (power of 2) can help with left-skewed data. Choose the power that best fits your data characteristics.\n")
  2. Initialize DataFrame and Columns:

    • Purpose: Prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    power_transformed_columns = pd.DataFrame()
  3. Determine Transformation Approach:

    • Purpose: Apply power transformation using either a specified map or a uniform power value.
    if power_map is not None:
       for variable, pwr in power_map.items():
           if variable in numerical_variables:
               transformed_column = np.power(transformed_df[variable], pwr)
               transformed_df[variable] = transformed_column
               power_transformed_columns[variable] = transformed_column
               print(f"✔ '{variable}' has been transformed with a power of {pwr}.\n")
    else:
       for variable in numerical_variables:
           transformed_column = np.power(transformed_df[variable], power)
           transformed_df[variable] = transformed_column
           power_transformed_columns[variable] = transformed_column
           print(f"✔ '{variable}' uniformly transformed with a power of {power}.\n")
  4. Output Results:

    • Purpose: Display the results and perform a sanity check.
    print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the power transformed columns:\n{power_transformed_columns.head()}\n")
    print(f"☻ HOW TO: Apply this transformation using `transformed_df, power_transformed_columns = transform_num(your_df, your_numerical_variables, method='power', power_map=your_power_map)`.\n")
    
    # sanity check
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original dataframe: {df.shape}")
    print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
    print("* Evaluate the distribution post-transformation to ensure it aligns with your analytical or modeling goals.\n")
    
    return transformed_df, power_transformed_columns

See the Full Function:

You can refer to the complete implementation of the transform_num() function, including the 'power' method, on GitHub: transform_num().