ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_num() method: 'quantile' #57

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description:


Method Functionality Idea:

The quantile transformation method maps the data to a specified distribution using quantiles. It transforms skewed or outlier-affected data to follow either a standard normal or uniform distribution, enhancing statistical analysis and ML model accuracy.

How it operates:

The method utilizes Quantile Transformation to map the data to the specified output distribution using a defined number of quantiles (n_quantiles). By default, it uses 1000 quantiles to finely approximate the empirical distribution, capturing detailed data structures while maintaining computational efficiency.

Usage:

To perform quantile transformation on numerical variables:

transformed_df, quantile_transformed_columns = transform_num(your_df, your_numerical_variables, method='quantile', output_distribution='normal', n_quantiles=1000, random_state=444)

This method returns both the DataFrame with transformed numerical variables and a DataFrame containing only the transformed columns. Additionally, you can customize: output_distribution, n_quantiles and random_state ! (they have default values set reasonably, so users who don't need this level of customization can just use the method out of the box)

Notes:

ETA444 commented 6 months ago

Here’s how you can implement the 'quantile' method within the transform_num() function:


Implementation Summary:

The 'quantile' method applies a quantile transformation to numerical variables, mapping the data to a specified distribution.

Code Breakdown:

  1. Method Header:

    • Purpose: Clearly indicate the start of the 'quantile' method implementation and provide context.
    if method.lower() == 'quantile':
       print(f"< QUANTILE TRANSFORMATION >")
       print(f" This method maps the data to a '{output_distribution}' distribution and n_quantiles = {n_quantiles}. Random state set to {random_state}")
       print(f"  ✔ Transforms skewed or outlier-affected data to follow a standard {'normal' if output_distribution == 'normal' else 'uniform'} distribution, improving statistical analysis and ML model accuracy.")
       print(f"  ✔ Utilizes {n_quantiles} quantiles to finely approximate the empirical distribution, capturing the detailed data structure while balancing computational efficiency.\n")
       print(f"☻ Tip: The choice of 1000 quantiles as a default provides a good compromise between detailed distribution mapping and practical computational demands. Adjust as needed based on dataset size and specificity.\n")
  2. Initialize Quantile Transformer:

    • Purpose: Prepare the necessary data structures for transformation.
    # initialize the DataFrame to work with
    transformed_df = df.copy()
    
    # define and apply Quantile Transformer
    quantile_transformer = QuantileTransformer(output_distribution=output_distribution, n_quantiles=n_quantiles, random_state=random_state)
    transformed_df[numerical_variables] = quantile_transformer.fit_transform(df[numerical_variables])
  3. Output Results:

    • Purpose: Display the results and perform a sanity check.
    # isolate transformed columns to give as part of output
    quantile_transformed_columns = transformed_df[numerical_variables]
    
    print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the transformed columns:\n{quantile_transformed_columns.head()}\n")
    print("☻ HOW TO: Apply this transformation using `transformed_df, quantile_transformed_columns = transform_num(your_df, your_numerical_variables, method='quantile', output_distribution='normal', n_quantiles=1000, random_state=444)`.\n")
    
    # sanity check
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original dataframe: {df.shape}")
    print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
    print("* After transformation, evaluate your data's distribution and consider its impact on your analysis or modeling approach.\n")
    
    return transformed_df, quantile_transformed_columns

See the Full Function:

You can refer to the complete implementation of the transform_num() function, including the 'robust' method, on GitHub: transform_num().