ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Write NumPy docstring for evaluate_dtype() #82

Closed ETA444 closed 6 months ago

ETA444 commented 6 months ago

Written and accessible:

help(evaluate_dtype)

This solution addresses the issue "Write NumPy docstring for evaluate_dtype()" by providing a detailed NumPy-style docstring for the evaluate_dtype() function.

Summary:

The function evaluate_dtype() categorizes the data types of specified columns in a DataFrame, with enhanced handling for numerical data that may functionally serve as categorical data. The docstring follows the NumPy format and includes details on the parameters, return values, exceptions, and examples.

Docstring Sections Preview:

Description

"""
Evaluates and categorizes data types of specified columns in a DataFrame, with enhanced handling for numerical data that may functionally serve as categorical data.

This function examines columns within a DataFrame to determine if they are numerical or categorical. It goes beyond simple data type checks by considering the distribution of unique values within numerical data. This allows for more nuanced categorization, where numerical columns with a limited range of unique values can be treated as categorical, based on specified thresholds.
"""

Parameters

"""
Parameters
----------
df : pd.DataFrame
    The DataFrame containing the data to be evaluated.
col_names : list
    A list of column names whose data types are to be evaluated.
max_unique_values_ratio : float, optional
    The maximum ratio of unique values to total observations for a column to be considered categorical. Default is 0.05.
min_unique_values : int, optional
    The minimum number of unique values for a column to be considered continuous. Columns below this threshold are categorized. Default is 10.
string_length_threshold : int, optional
    The average string length threshold above which a column is classified as text data. Default is 50.
small_dataset_threshold : int, optional
    The threshold for small datasets, below which the column is likely categorical. Default is 20.
output : str, optional
    Specifies the format of the output. Options are:
    - 'dict': Returns a dictionary mapping column names to their determined data types ('numerical', 'categorical', 'text', 'datetime').
    - 'list_n': Returns a list of booleans indicating whether each column in `col_names` is numerical (True) or not (False).
    - 'list_c': Returns a list of booleans indicating whether each column in `col_names` is categorical (True) or not (False).
    - 'list_d': Returns a list of booleans indicating whether each column in `col_names` is datetime (True) or not (False).
    - 'list_t': Returns a list of booleans indicating whether each column in `col_names` is text (True) or not (False).
"""

Returns

"""
Returns
-------
dict or list
   Depending on the 'output' parameter, this function returns:
       - A dictionary mapping column names to their determined data types ('numerical', 'categorical', 'text' or 'datetime').
       - A list of booleans indicating the nature of each column in `col_names`, according to the specified 'output' parameter.
"""

Raises

"""
Raises
------
TypeError
    - If `df` is not a pandas DataFrame.
    - If `col_names` is not a list or if elements of `col_names` are not all strings.
    - If `max_unique_values_ratio` is not a float or an integer.
    - If `min_unique_values`, `string_length_threshold`, `small_dataset_threshold` are not integers.
    - If `output` is not a string or does not match one of the expected output format strings ('dict', 'list_n', 'list_c').
ValueError
    - If the `df` is empty, indicating that there's no data to evaluate.
    - If `max_unique_values_ratio` is outside the range [0, 1].
    - If `min_unique_values` is less than 1, as at least one unique value is needed to categorize a column.
    - If `string_length_threshold` is less than or equal to 0, indicating an invalid threshold for text data classification.
    - If 'col_names' list is empty or any specified column names in `col_names` are not present in the DataFrame.
    - If the `output` string does not correspond to one of the valid options ('dict', 'list_n', 'list_c', 'list_d', 'list_t').
"""

Examples

"""
Examples
--------
>>> import pandas as pd
>>> import numpy as np
>>> df = pd.DataFrame({
       'Age': np.random.randint(18, 35, 100),
       'Income': np.random.normal(50000, 15000, 100),
       'Department': np.random.choice(['HR', 'Tech', 'Admin'], 100)
   })
>>> data_type_dict = evaluate_dtype(df, ['Age', 'Income', 'Department'], output='dict')
>>> numerical_bool_list = evaluate_dtype(df, ['Age', 'Income', 'Department'], output='list_n')
>>> categorical_bool_list = evaluate_dtype(df, ['Age', 'Income', 'Department'], output='list_c')
"""