This solution addresses the issue "Write NumPy docstring for transform_num()" by providing a detailed NumPy-style docstring for the transform_num() function.
Summary:
The function transform_num() applies various numerical data transformations to improve machine learning model performance or data analysis. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.
Docstring Sections Preview:
Description
"""
Applies various numerical data transformations to improve machine learning model performance or data analysis.
Parameters
Parameters
----------
df : pd.DataFrame
The DataFrame containing the numerical data to transform.
numerical_variables : list
A list of column names in `df` that are numerical and will be transformed.
method : str
The transformation method to apply. Valid methods include:
- 'standardize': Mean=0, SD=1. Suitable for algorithms sensitive to variable scales.
- 'log': Natural logarithm transformation for positively skewed data.
- 'normalize': Scales data to a [0, 1] range. Useful for models sensitive to variable scales.
- 'quantile': Transforms data to follow a specified distribution, improving statistical analysis.
- 'robust': Scales data using the median and quantile range, reducing the influence of outliers.
- 'boxcox': Normalizes skewed data, requires positive values.
- 'yeojohnson': Similar to Box-Cox but suitable for both positive and negative values.
- 'power': Raises numerical variables to specified powers for distribution adjustment.
- 'winsorization': Caps extreme values to reduce impact of outliers.
- 'interaction': Creates new features by multiplying pairs of numerical variables.
- 'polynomial': Generates polynomial features up to a specified degree.
- 'bin': Groups numerical data into bins or intervals.
output_distribution : str, optional
Specifies the output distribution for 'quantile' method ('normal' or 'uniform'). Default is 'normal'.
n_quantiles : int, optional
Number of quantiles to use for 'quantile' method. Default is 1000.
random_state : int, optional
Random state for 'quantile' method. Default is 444.
with_centering : bool, optional
Whether to center data before scaling for 'robust' method. Default is True.
quantile_range : tuple, optional
Quantile range used for 'robust' method. Default is (25.0, 75.0).
power : float, optional
The power to raise each numerical variable for 'power' method. Default is None.
power_map : dict, optional
A dictionary mapping variables to their respective powers for 'power' method. Default is None.
lower_percentile : float, optional
Lower percentile for 'winsorization'. Default is 0.01.
upper_percentile : float, optional
Upper percentile for 'winsorization'. Default is 0.99.
winsorization_map : dict, optional
A dictionary specifying winsorization bounds per variable. Default is None.
interaction_pairs : list, optional
List of tuples specifying pairs of variables for creating interaction terms. Default is None.
degree : int, optional
The degree for polynomial features in 'polynomial' method. Default is None.
degree_map : dict, optional
A dictionary mapping variables to their respective degrees for 'polynomial' method. Default is None.
bins : int, optional
The number of equal-width bins to use for 'bin' method. Default is None.
bin_map : dict, optional
A dictionary specifying custom binning criteria per variable for 'bin' method. Default is None.
Returns
Returns
-------
transformed_df : pd.DataFrame
The DataFrame with transformed numerical variables.
transformed_columns : pd.DataFrame
A DataFrame containing only the transformed columns.
Raises
Raises
------
TypeError
- If `df` is not a pandas DataFrame.
- If `numerical_variables` is not a list.
- If `method` is not a string.
- If `output_distribution` is provided but not a string.
- If `n_quantiles` is not an integer.
- If `random_state` is not an integer.
- If `with_centering` is not a boolean.
- If `quantile_range` is not a tuple of two floats.
- If `power` is provided but not a float.
- If `power_map`, `winsorization_map`, `degree_map`, or `bin_map` is provided but not a dictionary.
- If `lower_percentile` or `upper_percentile` is not a float.
- If `interaction_pairs` is not a list of tuples, or tuples are not of length 2.
- If `degree` is provided but not an integer.
- If `bins` is provided but not an integer.
ValueError
- If the input DataFrame is empty, ensuring that there is data available for model fitting.
- If 'numerical_variables' list is empty.
- If variables provided through 'numerical_variables' are not numerical variables.
- If any of the specified `numerical_variables` are not found in the DataFrame's columns.
- If the `method` specified is not one of the valid methods: 'standardize', 'log', 'normalize', 'quantile', 'robust', 'boxcox', 'yeojohnson', 'power', 'winsorization', 'interaction', 'polynomial', 'bin'.
- If `output_distribution` is not 'normal' or 'uniform' for the 'quantile' method.
- If `n_quantiles` is not a positive integer for the 'quantile' method.
- If `quantile_range` does not consist of two float values in the range 0 to 1 for the 'robust' method.
- If `power` is not provided for the 'power' method when required.
- If `lower_percentile` or `upper_percentile` is not between 0 and 1, or if `lower_percentile` is greater than or equal to `upper_percentile` for the 'winsorization' method.
- If `degree` is not provided or is not a positive integer for the 'polynomial' method when required.
- If `bins` is not a positive integer for the 'bin' method when required.
- If method is 'log', 'boxcox' or 'yeojohnson' and the provided columns have NAs or Infs raise as these statistical methods are not compatible with NAs or Infs.
- If specified keys in `power_map`, `winsorization_map`, `degree_map`, or `bin_map` do not match any column in the DataFrame.
- If the `interaction_pairs` specified do not consist of columns that exist in the DataFrame.
Written and accessible:
This solution addresses the issue "Write NumPy docstring for transform_num()" by providing a detailed NumPy-style docstring for the
transform_num()
function.Summary:
The function
transform_num()
applies various numerical data transformations to improve machine learning model performance or data analysis. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.Docstring Sections Preview:
Description
Parameters
Returns
Raises
Examples