ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_num() method: 'polynomial' #64

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description:


Method Functionality Idea:

The polynomial features transformation method generates polynomial features up to a specified degree for numerical variables. This technique captures non-linear relationships between variables and the target, enhancing model performance through feature engineering.

How it operates:

The method takes either a global degree (degree) or a variable-specific degree map (degree_map) and creates polynomial features for each numerical variable accordingly. For each variable, polynomial features up to the specified degree are generated and appended to the original DataFrame.

Usage:

To generate polynomial features for numerical variables:

# data to test with
data = {
    'Feature1': np.random.normal(0, 1, 100),  # Normally distributed data
    'Feature2': np.random.exponential(1, 100),  # Exponentially distributed data (positively skewed)
    'Feature3': np.random.randint(1, 100, 100)  # Uniformly distributed data between 1 and 100
}
df = pd.DataFrame(data)

# test polynomial with a degree_map
degree_map = {'Feature1': 2, 'Feature2': 3}
poly_transformed_df, poly_features = transform_num(df, ['Feature1', 'Feature2'], method='polynomial', degree_map=degree_map)

This method returns both the transformed DataFrame with original columns and polynomial features and a DataFrame containing only the polynomial features columns.

Notes:

ETA444 commented 6 months ago

Implementation Summary:

The 'polynomial' method within transform_num() generates polynomial features up to a specified degree for numerical variables. This transformation helps capture non-linear relationships and enhances model performance through feature engineering.

Purpose:

The purpose of this method is to generate new polynomial features for specified numerical variables, which can enhance the predictive power of machine learning models.

Code Breakdown:

  1. Method Header:

    • Purpose: To clearly indicate the start of the 'polynomial' method implementation and provide context.
    if method.lower() == 'polynomial' and (degree is not None or degree_map is not None):
       print(f"< POLYNOMIAL FEATURES TRANSFORMATION >")
       print(f" This method generates polynomial features up to a specified degree for numerical variables.")
       print(f"  ✔ Captures non-linear relationships between variables and the target.")
       print(f"  ✔ Enhances model performance by adding complexity through feature engineering.")
       print(f"✎ Note: Specify the 'degree' for a global application or 'degree_map' for variable-specific degrees.\n")
  2. Initialize DataFrame and Columns:

    • Purpose: To prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    poly_features = pd.DataFrame(index=df.index)
  3. Define Function for Applying Degrees:

    • Purpose: To apply polynomial transformation based on specified degree.
    def apply_degree(variable, d):
       for power in range(2, d + 1):  # Start from 2 as degree 1 is the original variable
           new_column_name = f"{variable}_degree_{power}"
           poly_features[new_column_name] = transformed_df[variable] ** power
           print(f"✔ Created polynomial feature '{new_column_name}' from variable '{variable}' to the power of {power}.\n")
  4. Apply Transformation:

    • Purpose: To generate polynomial features using either global degree or degree from degree_map.
    if degree_map:
       for variable, var_degree in degree_map.items():
           if variable in numerical_variables:
               apply_degree(variable, var_degree)
           else:
               print(f"⚠️ Variable '{variable}' specified in `degree_map` was not found in `numerical_variables` and has been skipped.\n")
    else:
       for variable in numerical_variables:
           apply_degree(variable, degree)
  5. Output Results:

    • Purpose: To display the results and perform a sanity check.
    transformed_df = pd.concat([transformed_df, poly_features], axis=1)
    
    print(f"✔ New transformed dataframe with polynomial features:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the polynomial features:\n{poly_features.head()}\n")
    print("☻ HOW TO: Apply this transformation using `transformed_df, poly_features = transform_num(your_df, your_numerical_variables, method='polynomial', degree=3)` or by specifying a `degree_map`.\n")
    
    # Sanity check
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original dataframe: {df.shape}")
    print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
    print("* After applying polynomial features, evaluate the model's performance and watch out for overfitting, especially when using high degrees.\n")
    
    return transformed_df, poly_features

See the Full Function:

You can refer to the complete implementation of the transform_num() function, including the 'polynomial' method, on GitHub: transform_num().