ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_num() method: 'standardize' #54

Closed ETA444 closed 6 months ago

ETA444 commented 9 months ago

Description:


Method Functionality Idea:

The standardize method in transform_num standardizes the numerical variables by centering them around mean 0 and scaling them to have a standard deviation of 1. This transformation enhances model performance and stability, particularly for machine learning algorithms.

How it operates:

The method first creates a copy of the DataFrame to preserve the original data. It then initializes a StandardScaler object to perform the standardization. The method scales each specified numerical variable and returns both the transformed DataFrame and a DataFrame containing only the scaled columns.

Usage:

To standardize numerical variables in a DataFrame:

# Standardize numerical variables 'Feature1' and 'Feature2'
transformed_df, standardized_columns = transform_num(df, ['Feature1', 'Feature2'], method='standardize')

Notes:

ETA444 commented 6 months ago

Implementation Summary:

The 'standardize' method centers the data around mean 0 with a standard deviation of 1, enhancing model performance and stability. It's an essential preprocessing step for many machine learning algorithms.

Code Breakdown:

  1. Method Header:

    • Purpose: Clearly indicate the start of the 'standardize' method implementation and provide context.
    if method.lower() == 'standardize':
       print(f"< STANDARDIZING DATA >")
       print(f" This method centers the data around mean 0 with a standard deviation of 1, enhancing model performance and stability.")
       print(f"  ✔ Standardizes each numerical variable to have mean=0 and variance=1.")
       print(f"  ✔ Essential preprocessing step for many machine learning algorithms.\n")
       print(f"✎ Note: Standardization is applied only to the specified numerical variables.\n")
  2. Initialize and Apply Standardization:

    • Purpose: Apply the standardization to specified columns using StandardScaler.
    # initialize essential objects
    transformed_df = df.copy()
    scaler = StandardScaler()
    
    # scale the data
    transformed_df[numerical_variables] = scaler.fit_transform(df[numerical_variables])
    
    # isolate transformed columns to give as part of output
    standardized_columns = transformed_df[numerical_variables]
  3. Output Results:

    • Purpose: Display the results and perform a sanity check.
    print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the transformed columns:\n{standardized_columns.head()}\n")
    print("☻ HOW TO - Apply this transformation using `transformed_df, standardized_columns = transform_num(your_df, your_numerical_variables, method='standardize')`.\n")
    
    # sanity check
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original dataframe: {df.shape}")
    print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
    
    return transformed_df, standardized_columns

Link to Full Code: transform_num.py