ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for transform_num() #50

Closed ETA444 closed 6 months ago

ETA444 commented 6 months ago

Written and accessible:

help(transform_num)

This solution addresses the issue "Write NumPy docstring for transform_num()" by providing a detailed NumPy-style docstring for the transform_num() function.

Summary:

The function transform_num() applies various numerical data transformations to improve machine learning model performance or data analysis. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.

Docstring Sections Preview:

Description

"""
Applies various numerical data transformations to improve machine learning model performance or data analysis.

Parameters

Parameters
----------
df : pd.DataFrame
    The DataFrame containing the numerical data to transform.
numerical_variables : list
    A list of column names in `df` that are numerical and will be transformed.
method : str
    The transformation method to apply. Valid methods include:
        - 'standardize': Mean=0, SD=1. Suitable for algorithms sensitive to variable scales.
        - 'log': Natural logarithm transformation for positively skewed data.
        - 'normalize': Scales data to a [0, 1] range. Useful for models sensitive to variable scales.
        - 'quantile': Transforms data to follow a specified distribution, improving statistical analysis.
        - 'robust': Scales data using the median and quantile range, reducing the influence of outliers.
        - 'boxcox': Normalizes skewed data, requires positive values.
        - 'yeojohnson': Similar to Box-Cox but suitable for both positive and negative values.
        - 'power': Raises numerical variables to specified powers for distribution adjustment.
        - 'winsorization': Caps extreme values to reduce impact of outliers.
        - 'interaction': Creates new features by multiplying pairs of numerical variables.
        - 'polynomial': Generates polynomial features up to a specified degree.
        - 'bin': Groups numerical data into bins or intervals.
output_distribution : str, optional
    Specifies the output distribution for 'quantile' method ('normal' or 'uniform'). Default is 'normal'.
n_quantiles : int, optional
    Number of quantiles to use for 'quantile' method. Default is 1000.
random_state : int, optional
    Random state for 'quantile' method. Default is 444.
with_centering : bool, optional
    Whether to center data before scaling for 'robust' method. Default is True.
quantile_range : tuple, optional
    Quantile range used for 'robust' method. Default is (25.0, 75.0).
power : float, optional
    The power to raise each numerical variable for 'power' method. Default is None.
power_map : dict, optional
    A dictionary mapping variables to their respective powers for 'power' method. Default is None.
lower_percentile : float, optional
    Lower percentile for 'winsorization'. Default is 0.01.
upper_percentile : float, optional
    Upper percentile for 'winsorization'. Default is 0.99.
winsorization_map : dict, optional
    A dictionary specifying winsorization bounds per variable. Default is None.
interaction_pairs : list, optional
    List of tuples specifying pairs of variables for creating interaction terms. Default is None.
degree : int, optional
    The degree for polynomial features in 'polynomial' method. Default is None.
degree_map : dict, optional
    A dictionary mapping variables to their respective degrees for 'polynomial' method. Default is None.
bins : int, optional
    The number of equal-width bins to use for 'bin' method. Default is None.
bin_map : dict, optional
    A dictionary specifying custom binning criteria per variable for 'bin' method. Default is None.

Returns

Returns
-------
transformed_df : pd.DataFrame
    The DataFrame with transformed numerical variables.
transformed_columns : pd.DataFrame
    A DataFrame containing only the transformed columns.

Raises

Raises
------
TypeError
    - If `df` is not a pandas DataFrame.
    - If `numerical_variables` is not a list.
    - If `method` is not a string.
    - If `output_distribution` is provided but not a string.
    - If `n_quantiles` is not an integer.
    - If `random_state` is not an integer.
    - If `with_centering` is not a boolean.
    - If `quantile_range` is not a tuple of two floats.
    - If `power` is provided but not a float.
    - If `power_map`, `winsorization_map`, `degree_map`, or `bin_map` is provided but not a dictionary.
    - If `lower_percentile` or `upper_percentile` is not a float.
    - If `interaction_pairs` is not a list of tuples, or tuples are not of length 2.
    - If `degree` is provided but not an integer.
    - If `bins` is provided but not an integer.
ValueError
    - If the input DataFrame is empty, ensuring that there is data available for model fitting.
    - If 'numerical_variables' list is empty.
    - If variables provided through 'numerical_variables' are not numerical variables.
    - If any of the specified `numerical_variables` are not found in the DataFrame's columns.
    - If the `method` specified is not one of the valid methods: 'standardize', 'log', 'normalize', 'quantile', 'robust', 'boxcox', 'yeojohnson', 'power', 'winsorization', 'interaction', 'polynomial', 'bin'.
    - If `output_distribution` is not 'normal' or 'uniform' for the 'quantile' method.
    - If `n_quantiles` is not a positive integer for the 'quantile' method.
    - If `quantile_range` does not consist of two float values in the range 0 to 1 for the 'robust' method.
    - If `power` is not provided for the 'power' method when required.
    - If `lower_percentile` or `upper_percentile` is not between 0 and 1, or if `lower_percentile` is greater than or equal to `upper_percentile` for the 'winsorization' method.
    - If `degree` is not provided or is not a positive integer for the 'polynomial' method when required.
    - If `bins` is not a positive integer for the 'bin' method when required.
    - If method is 'log', 'boxcox' or 'yeojohnson' and the provided columns have NAs or Infs raise as these statistical methods are not compatible with NAs or Infs.
    - If specified keys in `power_map`, `winsorization_map`, `degree_map`, or `bin_map` do not match any column in the DataFrame.
    - If the `interaction_pairs` specified do not consist of columns that exist in the DataFrame.

Examples

Examples
--------
>>> import numpy as np
>>> import pandas as pd
>>> df = pd.DataFrame({'Feature1': np.random.normal(0, 1, 100), 'Feature2': np.random.exponential(1, 100), 'Feature3': np.random.randint(1, 100, 100)})
>>> num_cols = ['Feature1', 'Feature2', 'Feature3']

# Standardize
>>> standardized_data, standardized_cols = transform_num(df, num_cols, method='standardize')

# Log transformation
>>> log_data, log_cols = transform_num(df, num_cols, method='log')

# Normalize
>>> normalized_data, normalized_cols = transform_num(df, num_cols, method='normalize')

# Quantile transformation
>>> quant_transformed_data, quant_transformed_cols = transform_num(df, num_cols, method='quantile', output_distribution='normal', n_quantiles=1000, random_state=444)

# Robust scaling
>>> robust_transformed_df, robust_transformed_columns = transform_num(df, num_cols, method='robust', with_centering=True, quantile_range=(25.0, 75.0))

# Box-Cox transformation
>>> boxcox_transformed_df, boxcox_transformed_columns = transform_num(df, num_cols, method='boxcox')

# Yeo-Johnson transformation
>>> yeojohnson_transformed_df, yeojohnson_transformed_columns = transform_num(df, num_cols, method='yeojohnson')

# Power transformation using a uniform power
>>> power_transformed_df1, power_transformed_columns1 = transform_num(df, num_cols, method='power', power=2)

# Power transformation using a power map
>>> power_map = {'Feature1': 2, 'Feature2': 3, 'Feature3': 4}
>>> power_transformed_df2, power_transformed_columns2 = transform_num(df, num_cols, method='power', power_map=power_map)

# Winsorization with global thresholds
>>> wins_transformed_df1, wins_transformed_columns1 = transform_num(df, num_cols, method='winsorization', lower_percentile=0.01, upper_percentile=0.99)

# Winsorization using a winsorization map
>>> win_map = {'Feature1': (0.01, 0.99), 'Feature2': (0.05, 0.95), 'Feature3': [0.10, 0

.90]}
>>> wins_transformed_df2, wins_transformed_columns2 = transform_num(df, num_cols, method='winsorization', winsorization_map=win_map)

# Interaction terms
>>> interactions = [('Feature1', 'Feature2'), ('Feature2', 'Feature3')]
>>> inter_transformed_df, inter_columns = transform_num(df, num_cols, method='interaction', interaction_pairs=interactions)

# Polynomial features with a degree map
>>> degree_map = {'Feature1': 2, 'Feature2': 3}
>>> poly_transformed_df, poly_features = transform_num(df, ['Feature1', 'Feature2'], method='polynomial', degree_map=degree_map)

# Binning with a bin map
>>> bin_map = {'Feature2': {'bins': 5}, 'Feature3': {'edges': [1, 20, 40, 60, 80, 100]}}
>>> bin_transformed_df, binned_columns = transform_num(df, ['Feature2', 'Feature3'], method='bin', bin_map=bin_map)
"""