ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new transform_num() method: 'bin' #65

Closed ETA444 closed 6 months ago

ETA444 commented 8 months ago

Description:


Method Functionality Idea:

The binning transformation method groups numerical data into bins or intervals, simplifying relationships and reducing noise. Users can specify either a uniform number of bins for all variables or define custom binning criteria per variable.

How it operates:

The method takes either a global number of bins (bins) or a variable-specific binning map (bin_map) as input. It then applies binning based on the provided specifications, either uniformly or customizing bins for each variable. Binning is performed using the pd.cut() function, which segments the data into intervals based on the specified criteria.

Usage:

To bin numerical variables:

# example data
data = {
    'Feature1': np.random.normal(0, 1, 100),  # Normally distributed data
    'Feature2': np.random.exponential(1, 100),  # Exponentially distributed data (positively skewed)
    'Feature3': np.random.randint(1, 100, 100)  # Uniformly distributed data between 1 and 100
}
df = pd.DataFrame(data)

# bin map format using the two modes
bin_map = {
    'Feature2': {'bins': 5},  # For simplicity, specifying only the number of bins for Feature2*
    'Feature3': {'edges': [1, 20, 40, 60, 80, 100]}  # Defining custom bin edges for Feature3*
}

# apply binning based on bin_map 
bin_transformed_df, binned_columns = transform_num(df, ['Feature2', 'Feature3'], method='bin', bin_map=bin_map)

# This would bin Feature2 into 5 equal-width intervals based on its data range.
# For Feature3, it creates bins based on the specified edges: [1-20), [20-40), [40-60), [60-80), [80-100].

This method returns both the DataFrame with original variables and binned variables and a DataFrame containing only the binned columns.

Notes:

ETA444 commented 6 months ago

Implementation Summary:

The 'bin' method within transform_num() groups numerical data into bins or intervals. This transformation is useful for simplifying relationships and reducing noise.

Purpose:

The purpose of this method is to discretize continuous numerical variables into categorical bins, which can be useful for various data analysis and modelling tasks.

Code Breakdown:

  1. Method Header:

    • Purpose: To clearly indicate the start of the 'bin' method implementation and provide context.
    if method.lower() == 'bin' and (bins is not None or bin_map is not None):
       print(f"< BINNING TRANSFORMATION >")
       print(f" This method groups numerical data into bins or intervals, simplifying relationships and reducing noise.")
       print(f"  ✔ Users can specify a uniform number of bins for all variables or define custom binning criteria per variable.")
       print(f"✎ Note: Binning can be specified globally with 'bins' or individually with 'bin_map'.\n")
  2. Initialize DataFrame and Columns:

    • Purpose: To prepare the necessary data structures for transformation.
    transformed_df = df.copy()
    binned_columns = pd.DataFrame()
  3. Apply Uniform Binning:

    • Purpose: To apply a uniform number of bins across all specified numerical variables.
    if bins:
       for variable in numerical_variables:
           transformed_df[variable], bin_edges = pd.cut(transformed_df[variable], bins, retbins=True, labels=range(bins))
           binned_columns = pd.concat([binned_columns, transformed_df[[variable]]], axis=1)
           print(f"✔ '{variable}' has been binned into {bins} intervals.\n")
  4. Apply Custom Binning:

    • Purpose: To apply custom binning based on the provided binning map.
    elif bin_map:
       for variable, specs in bin_map.items():
           if variable in numerical_variables:
               n_bins = specs.get('bins')
               bin_edges = specs.get('edges', None)
               if bin_edges:
                   transformed_df[variable], _ = pd.cut(transformed_df[variable], bins=bin_edges, retbins=True, labels=range(len(bin_edges) - 1))
               else:
                   transformed_df[variable], _ = pd.cut(transformed_df[variable], bins=n_bins, retbins=True, labels=range(n_bins))
               binned_columns = pd.concat([binned_columns, transformed_df[[variable]]], axis=1)
               print(f"✔ '{variable}' has been custom binned based on provided specifications.\n")
  5. Output Results:

    • Purpose: To display the results and perform a sanity check.
    print(f"✔ New transformed dataframe with binned variables:\n{transformed_df.head()}\n")
    print(f"✔ Dataframe with only the binned columns:\n{binned_columns.head()}\n")
    print("☻ HOW TO: Apply this transformation using `transformed_df, binned_columns = transform_num(your_df, your_numerical_variables, method='bin', bins=3)` or use bin_map.\n")
    
    # Sanity check
    print("< SANITY CHECK >")
    print(f"  ➡ Shape of original dataframe: {df.shape}")
    print(f"  ➡ Shape of transformed dataframe: {transformed_df.shape}\n")
    print("* Review the binned data to ensure it aligns with your analysis or modeling strategy.\n")
    
    return transformed_df, binned_columns

See the Full Function:

You can refer to the complete implementation of the transform_num() function, including the 'bin' method, on GitHub: transform_num().