ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for explore_cat() #22

Closed ETA444 closed 7 months ago

ETA444 commented 7 months ago

Written and accessible:

help(explore_cat)

This solution addresses the issue "Write NumPy docstring for explore_cat()" by providing a detailed NumPy-style docstring for the explore_cat() function.

Summary:

The function explore_cat() explores categorical variables within a DataFrame, providing insights through various methods. The updated docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.

Docstring Sections Preview:

Description

"""
Explores categorical variables within a DataFrame, providing insights through various methods. The exploration
can yield unique values, counts and percentages of those values, and the entropy to quantify data diversity.
"""

Parameters

"""
Parameters
----------
df : pd.DataFrame
    The DataFrame containing the data to be explored.
categorical_variables : list
    A list of strings specifying the names of the categorical columns to explore.
method : str, default 'all'
    Specifies the method of exploration to apply. Options include:
        - 'unique_values': Lists unique values for each specified categorical variable.
        - 'counts_percentage': Shows counts and percentages for the unique values of each variable.
        - 'entropy': Calculates the entropy for each variable, providing a measure of data diversity. See the 'calculate_entropy' function for more details on entropy calculation.
        - 'all': Applies all the above methods sequentially.
output : str, default 'print'
    Determines how the exploration results are outputted. Options are:
        - 'print': Prints the results to the console.
        - 'return': Returns the results as a single formatted string.
"""

Returns

"""
Returns
-------
str or None
    - If output='return', a string containing the formatted exploration results is returned.
    - If output='print', results are printed to the console, and the function returns None.
"""

Raises

"""
Raises
------
TypeError
    - If `df` is not a pandas DataFrame.
    - If `categorical_variables` is not a list or contains non-string elements.
    - If `method` or `output` is not a string.
ValueError
    - If the `df` is empty, indicating that there's no data to evaluate.
    - If `method` is not one of the valid options ('unique_values', 'counts_percentage', 'entropy', 'all').
    - If `output` is not one of the valid options ('print', 'return').
    - If 'categorical_variables' list is empty.
    - If variables provided through 'categorical_variables' are not categorical variables.
    - If any of the specified categorical variables are not found in the DataFrame.
"""

Examples

"""
Examples
--------
# Create a sample DataFrame to use in the examples:
>>> import numpy as np
>>> import pandas as pd
>>> data = {
...     'Category1': np.random.choice(['Apple', 'Banana', 'Cherry'], size=100),
...     'Category2': np.random.choice(['Yes', 'No'], size=100),
...     'Category3': np.random.choice(['Low', 'Medium', 'High'], size=100)
... }
>>> df = pd.DataFrame(data)

# Display unique values for 'Category1' and 'Category2'
>>> explore_cat(df, ['Category1', 'Category2'], method='unique_values', output='print')

# Explore counts and percentages for 'Category1' and 'Category2', then print the results
>>> explore_cat(df, ['Category1', 'Category2'], method='counts_percentage', output='print')

# Calculate and return the entropy of 'Category1', 'Category2', and 'Category3'
>>> result = explore_cat(df, ['Category1', 'Category2', 'Category3'], method='entropy', output='return')
>>> print(result)

# Comprehensive exploration of all specified methods for 'Category1', 'Category2', and 'Category3', displaying to console
>>> explore_cat(df, ['Category1', 'Category2', 'Category3'], method='all', output='print')

# Using 'all' method to explore 'Category1' and 'Category2', returning the results as a string
>>> result_str = explore_cat(df, ['Category1', 'Category2'], method='all', output='return')
>>> print(result_str)
"""

Notes

"""
Notes
-----
The 'entropy' method provides a quantitative measure of the unpredictability or
diversity within each specified categorical column, calculated as outlined in the
documentation for 'calculate_entropy'. High entropy values indicate a more uniform
distribution of categories, suggesting no single category overwhelmingly dominates.
"""