Write NumPy docstring for explore_num()

Written and accessible:

help(explore_num)

This solution addresses the issue "Write NumPy docstring for explore_num()" by providing a detailed NumPy-style docstring for the explore_num() function.

Summary:

The function explore_num() analyzes numerical variables in a DataFrame for various characteristics and issues. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.

Docstring Sections Preview:

Description

"""
Analyze numerical variables in a DataFrame for distribution characteristics, outlier detection using multiple methods (Z-score, IQR, Mahalanobis), normality tests, skewness, kurtosis, correlation analysis, and multicollinearity detection.
"""

Parameters

"""
Parameters
----------
df : pd.DataFrame
    The DataFrame containing the numerical data to analyze.
numerical_variables : list
    A list of strings representing the column names in `df` to be analyzed.
method : str, optional, default 'all'
    Specifies the analysis method to apply. Options include:
    - 'correlation_analysis' for analyzing the correlation between numerical variables.
    - 'distribution_analysis' for distribution characteristics, including skewness and kurtosis, and normality tests (Shapiro-Wilk, Anderson-Darling).
    - 'outliers_zscore' for outlier detection using the Z-score method.
    - 'outliers_iqr' for outlier detection using the Interquartile Range method.
    - 'outliers_mahalanobis' for outlier detection using the Mahalanobis distance.
    - 'multicollinearity' for detecting multicollinearity among the numerical variables.
    - 'all' to perform all available analyses. Default is 'all'.
output : str, optional, default 'print'
    Determines the output format. Options include:
    - 'print' to print the analysis results to the console.
    - 'return' to return the analysis results as a DataFrame or dictionaries, depending on the analysis type. Default is 'print'.
threshold_z : int, optional, default 3
    Used in method 'outliers_zscore', users can define their preferred z-score threshold, if the default value does not fit their needs.
"""

Returns

"""
Returns
-------
Depending on the method and output chosen:
- For 'correlation_analysis', returns a DataFrame showing the correlation coefficients between variables if output is 'return'.
- For 'distribution_analysis', returns a DataFrame with distribution statistics if output is 'return'.
- For outlier detection methods ('outliers_zscore', 'outliers_iqr', 'outliers_mahalanobis'), returns a dictionary mapping variables to their outlier values and a DataFrame of rows considered outliers if output is 'return'.
- For 'multicollinearity', returns a DataFrame or a Series indicating the presence of multicollinearity, such as VIF scores, if output is 'return'.
- If 'output' is set to 'return' and 'method' is 'all', returns a comprehensive summary of all analyses as text or a combination of DataFrames and dictionaries.
"""

Raises

"""
Raises
------
TypeError
    - If `df` is not a pandas DataFrame.
    - If `numerical_variables` is not a list of strings.
    - If `method` is not a string.
    - If `output` is not a string.
    - If `threshold_z` is not a float or an int.
ValueError
    - If the `df` is empty, indicating that there's no data to evaluate.
    - If `method` is not one of the specified valid methods ('correlation_analysis', 'distribution_analysis', 'outliers_zscore', 'outliers_iqr', 'outliers_mahalanobis', 'multicollinearity', 'all').
    - If `output` is not 'print' or 'return'.
    - If 'numerical_variables' list is empty.
    - If variables provided through 'numerical_variables' are not numerical variables.
    - If any specified variables in `numerical_variables` are not found in the DataFrame's columns.
"""

Examples

"""
Examples
--------
# Generating a sample DataFrame for demonstration
>>> import numpy as np
>>> import pandas as pd
>>> np.random.seed(0) # For reproducible results
>>> data = {
...     'Feature1': np.random.normal(loc=0, scale=1, size=100), # Normally distributed data
...     'Feature2': np.random.exponential(scale=2, size=100),   # Exponentially distributed data
...     'Feature3': np.random.randint(low=1, high=100, size=100) # Uniformly distributed integers
... }
>>> df = pd.DataFrame(data)

# Importing the explore_num function (assuming it is defined elsewhere in your module)
# from your_module import explore_num

# Performing correlation analysis and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='correlation_analysis', output='print')

# Conducting distribution analysis and capturing the returned DataFrame for further analysis
>>> distribution_results = explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='distribution_analysis', output='return')
>>> print(distribution_results)

# Detecting outliers using the IQR method and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_iqr', output='print')

# Detecting outliers using the Z-score method with a custom threshold and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_zscore', output='print', threshold_z=2.5)

# Identifying outliers using the Mahalanobis distance method and printing the results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='outliers_mahalanobis', output='print')

# Examining multicollinearity among the numerical features and printing the VIF scores
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='multicollinearity', output='print')

# Applying all available analyses and printing the comprehensive results
>>> explore_num(df, ['Feature1', 'Feature2', 'Feature3'], method='all', output='print')
"""

Notes

"""
Notes
-----
- Enhances interpretability by providing insights and conclusions based on the statistical tests and analyses conducted.
- Normality tests assess whether data distribution departs from a normal distribution, which is crucial for certain statistical analyses.
- Correlation analysis examines the strength and direction of relationships between numerical variables.
- Multicollinearity detection is essential for regression analysis, as high multicollinearity can invalidate the model.
"""

ETA444 / datasafari

Write NumPy docstring for explore_num() #21