This solution addresses the issue "Write NumPy docstring for explore_num()" by providing a detailed NumPy-style docstring for the explore_num() function.
Summary:
The function explore_num() analyzes numerical variables in a DataFrame for various characteristics and issues. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.
Docstring Sections Preview:
Description
"""
Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.
"""
Parameters
"""
Parameters
----------
df : pd.DataFrame
The DataFrame containing the numerical data to analyze.
numerical_variables : list
A list of strings representing the column names in `df` to be analyzed.
method : str, optional, default 'all'
Specifies the analysis method to apply. Options include:
- 'correlation_analysis' for analyzing the correlation between numerical variables.
- 'distribution_analysis' for distribution characteristics, including skewness and kurtosis, and normality tests (Shapiro-Wilk, Anderson-Darling).
- 'outliers_zscore' for outlier detection using the Z-score method.
- 'outliers_iqr' for outlier detection using the Interquartile Range method.
- 'outliers_mahalanobis' for outlier detection using the Mahalanobis distance.
- 'multicollinearity' for detecting multicollinearity among the numerical variables.
- 'all' to perform all available analyses. Default is 'all'.
output : str, optional, default 'print'
Determines the output format. Options include:
- 'print' to print the analysis results to the console.
- 'return' to return the analysis results as a DataFrame or dictionaries, depending on the analysis type. Default is 'print'.
threshold_z : int, optional, default 3
Used in method 'outliers_zscore', users can define their preferred z-score threshold, if the default value does not fit their needs.
"""
Returns
"""
Returns
-------
Depending on the method and output chosen:
- For 'correlation_analysis', returns a DataFrame showing the correlation coefficients between variables if output is 'return'.
- For 'distribution_analysis', returns a DataFrame with distribution statistics if output is 'return'.
- For outlier detection methods ('outliers_zscore', 'outliers_iqr', 'outliers_mahalanobis'), returns a dictionary mapping variables to their outlier values and a DataFrame of rows considered outliers if output is 'return'.
- For 'multicollinearity', returns a DataFrame or a Series indicating the presence of multicollinearity, such as VIF scores, if output is 'return'.
- If 'output' is set to 'return' and 'method' is 'all', returns a comprehensive summary of all analyses as text or a combination of DataFrames and dictionaries.
"""
Raises
"""
Raises
------
TypeError
- If `df` is not a pandas DataFrame.
- If `numerical_variables` is not a list of strings.
- If `method` is not a string.
- If `output` is not a string.
- If `threshold_z` is not a float or an int.
ValueError
- If the `df` is empty, indicating that there's no data to evaluate.
- If `method` is not one of the specified valid methods ('correlation_analysis', 'distribution_analysis', 'outliers_zscore', 'outliers_iqr', 'outliers_mahalanobis', 'multicollinearity', 'all').
- If `output` is not 'print' or 'return'.
- If 'numerical_variables' list is empty.
- If variables provided through 'numerical_variables' are not numerical variables.
- If any specified variables in `numerical_variables` are not found in the DataFrame's columns.
"""
Examples
"""
Examples
--------
# Generating a sample DataFrame for demonstration
>>> import numpy as np
>>> import pandas as pd
>>> np.random.seed(0) # For reproducible results
>>> data = {
... 'Feature1': np.random.normal(loc=0, scale=1, size=100), # Normally distributed data
... 'Feature2': np.random.exponential(scale=2, size=100), # Exponentially distributed data
... 'Feature3': np.random.randint(low=1, high=100, size=100) # Uniformly distributed integers
... }
>>> df = pd.DataFrame(data)
# Importing the explore_num function (assuming it is defined elsewhere in your module)
# from your_module import explore_num
# Performing correlation analysis and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='correlation_analysis', output='print')
# Conducting distribution analysis and capturing the returned DataFrame for further analysis
>>> distribution_results = explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='distribution_analysis', output='return')
>>> print(distribution_results)
# Detecting outliers using the IQR method and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_iqr', output='print')
# Detecting outliers using the Z-score method with a custom threshold and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_zscore', output='print', threshold_z=2.5)
# Identifying outliers using the Mahalanobis distance method and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_mahalanobis', output='print')
# Examining multicollinearity among the numerical features and printing the VIF scores
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='multicollinearity', output='print')
# Applying all available analyses and printing the comprehensive results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='all', output='print')
"""
Notes
"""
Notes
-----
- Enhances interpretability by providing insights and conclusions based on the statistical tests and analyses conducted.
- Normality tests assess whether data distribution departs from a normal distribution, which is crucial for certain statistical analyses.
- Correlation analysis examines the strength and direction of relationships between numerical variables.
- Multicollinearity detection is essential for regression analysis, as high multicollinearity can invalidate the model.
"""
Written and accessible:
This solution addresses the issue "Write NumPy docstring for explore_num()" by providing a detailed NumPy-style docstring for the
explore_num()
function.Summary:
The function
explore_num()
analyzes numerical variables in a DataFrame for various characteristics and issues. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.Docstring Sections Preview:
Description
Parameters
Returns
Raises
Examples
Notes