Implement new transform_num() method: 'quantile'

Description:

Method Functionality Idea:

The quantile transformation method maps the data to a specified distribution using quantiles. It transforms skewed or outlier-affected data to follow either a standard normal or uniform distribution, enhancing statistical analysis and ML model accuracy.

How it operates:

The method utilizes Quantile Transformation to map the data to the specified output distribution using a defined number of quantiles (n_quantiles). By default, it uses 1000 quantiles to finely approximate the empirical distribution, capturing detailed data structures while maintaining computational efficiency.

Usage:

To perform quantile transformation on numerical variables:

transformed_df, quantile_transformed_columns = transform_num(your_df, your_numerical_variables, method='quantile', output_distribution='normal', n_quantiles=1000, random_state=444)

This method returns both the DataFrame with transformed numerical variables and a DataFrame containing only the transformed columns. Additionally, you can customize: output_distribution, n_quantiles and random_state ! (they have default values set reasonably, so users who don't need this level of customization can just use the method out of the box)

Notes:

The choice of 1000 quantiles as a default provides a good compromise between detailed distribution mapping and practical computational demands. Adjust as needed based on dataset size and specificity.
After transformation, it's essential to evaluate your data's distribution and consider its impact on your analysis or modelling approach.

Here’s how you can implement the 'quantile' method within the transform_num() function:

Implementation Summary:

The 'quantile' method applies a quantile transformation to numerical variables, mapping the data to a specified distribution.

Code Breakdown:

Method Header:

Purpose: Clearly indicate the start of the 'quantile' method implementation and provide context.

if method.lower() == 'quantile':
   print(f"< QUANTILE TRANSFORMATION >")
   print(f" This method maps the data to a '{output_distribution}' distribution and n_quantiles = {n_quantiles}. Random state set to {random_state}")
   print(f"  ✔ Transforms skewed or outlier-affected data to follow a standard {'normal' if output_distribution == 'normal' else 'uniform'} distribution, improving statistical analysis and ML model accuracy.")
   print(f"  ✔ Utilizes {n_quantiles} quantiles to finely approximate the empirical distribution, capturing the detailed data structure while balancing computational efficiency.\n")
   print(f"☻ Tip: The choice of 1000 quantiles as a default provides a good compromise between detailed distribution mapping and practical computational demands. Adjust as needed based on dataset size and specificity.\n")

Initialize Quantile Transformer:

Purpose: Prepare the necessary data structures for transformation.

# initialize the DataFrame to work with
transformed_df = df.copy()

# define and apply Quantile Transformer
quantile_transformer = QuantileTransformer(output_distribution=output_distribution, n_quantiles=n_quantiles, random_state=random_state)
transformed_df[numerical_variables] = quantile_transformer.fit_transform(df[numerical_variables])

Output Results:

Purpose: Display the results and perform a sanity check.

# isolate transformed columns to give as part of output
quantile_transformed_columns = transformed_df[numerical_variables]

print(f"✔ New transformed dataframe:\n{transformed_df.head()}\n")
print(f"✔ Dataframe with only the transformed columns:\n{quantile_transformed_columns.head()}\n")
print("☻ HOW TO: Apply this transformation using `transformed_df, quantile_transformed_columns = transform_num(your_df, your_numerical_variables, method='quantile', output_distribution='normal', n_quantiles=1000, random_state=444)`.\n")

# sanity check
print("< SANITY CHECK >")
print(f"  ➡ Shape of original dataframe: {df.shape}")
print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
print("* After transformation, evaluate your data's distribution and consider its impact on your analysis or modeling approach.\n")

return transformed_df, quantile_transformed_columns

See the Full Function:

You can refer to the complete implementation of the transform_num() function, including the 'robust' method, on GitHub: transform_num().

ETA444 / datasafari