ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.
https://datasafari.dev
GNU General Public License v3.0
2 stars 0 forks source link

Implement new explore_df() methods: 'desc', 'head', 'info' & 'na' #23

Closed ETA444 closed 7 months ago

ETA444 commented 9 months ago

Description:


Module Functionality Idea:

The explore_df function provides a comprehensive overview of a DataFrame by applying specified exploration methods. It offers options for controlling the output format and method-specific parameters.

How it operates:

The function evaluates the DataFrame based on the specified exploration method, such as describing summary statistics, displaying the first few rows, providing concise information, or counting missing values. It then generates the exploration results according to the chosen method and output format.

Usage:

To utilize the explore_df function:

explore_df(your_df, method='desc', output='print', **kwargs)

Where:

Methods:


1. Describe Method:

The desc method provides summary statistics of the DataFrame, including count, mean, standard deviation, minimum, 25th percentile, median, 75th percentile, and maximum. This method is useful for understanding the distribution and central tendencies of numerical data.

Usage:

explore_df(df, method='desc', output='print', percentiles=[0.05, 0.95])

This example will display summary statistics with custom percentiles (5th and 95th).

2. Head Method:

The head method displays the first few rows of the DataFrame. It is helpful for quickly inspecting the structure and contents of the DataFrame.

Usage:

explore_df(df, method='head', output='print', n=3)

This example will display the first 3 rows of the DataFrame.

3. Info Method:

The info method provides concise information about the DataFrame, including the data types of each column, memory usage, and non-null counts. It is useful for understanding the structure and memory footprint of the DataFrame.

Usage:

explore_df(df, method='info', output='print', verbose=True)

This example will display detailed DataFrame information with verbose output.

4. NA Method:

The na method counts the number and percentage of missing values in each column of the DataFrame. It helps identify columns with missing data and assess data completeness.

Usage:

explore_df(df, method='na', output='print')

This example will display the count and percentage of missing values in each column of the DataFrame.

5. All Methods:

The 'all' option executes all exploration methods sequentially, providing a comprehensive overview of the DataFrame.

Usage:

explore_df(df, method='all', output='print', n=3, percentiles=[0.25, 0.75])

This example will execute all exploration methods with custom settings, including displaying the first 3 rows and summary statistics with custom percentiles (25th and 75th).

Example Usage:


# Summary statistics with custom percentiles
explore_df(df, method='desc', output='print', percentiles=[0.05, 0.95])

# Display the first 3 rows
explore_df(df, method='head', output='print', n=3)

# Detailed DataFrame information
explore_df(df, method='info', output='print', verbose=True)

# Count and percentage of missing values
explore_df(df, method='na', output='print')

# Comprehensive exploration with custom settings
explore_df(df, method='all', output='print', n=3, percentiles=[0.25, 0.75])

# Returning comprehensive exploration results as a string
result_str = explore_df(df, method='all', output='return', n=5)
print(result_str)
ETA444 commented 7 months ago

Implementation Summary:

The explore_df() function facilitates the exploration of a DataFrame using specified methods such as desc, head, info, na, or all of them together. The function also allows for control over the output format and provides method-specific parameters through keyword arguments (kwargs).

Code Breakdown:

  1. Function Definition and Parameters:
def explore_df(
        df: pd.DataFrame,
        method: str = 'all',
        output: str = 'print',
        **kwargs
) -> Optional[str]:
  1. Error Handling:

    • Type Errors:
    if not isinstance(df, pd.DataFrame):
        raise TypeError("explore_df(): The df parameter must be a pandas DataFrame.")

    if not isinstance(method, str):
        raise TypeError("explore_df(): The method parameter must be a string.\nExample: method = 'all'")

    if not isinstance(output, str):
        raise TypeError("explore_df(): The output parameter must be a string.\nExample: output = 'print'")
    if df.empty:
        raise ValueError("explore_df(): The input DataFrame is empty.")

    valid_methods = ['na', 'desc', 'head', 'info', 'all']
    if method.lower() not in valid_methods:
        raise ValueError(f"explore_df(): Invalid method '{method}'. Valid options are: {', '.join(valid_methods)}.")

    if output.lower() not in ['print', 'return']:
        raise ValueError("explore_df(): Invalid output method. Choose 'print' or 'return'.")

    if 'buf' in kwargs and method.lower() == 'info':
        raise ValueError("explore_df(): 'buf' parameter is not supported in the 'info' method within explore_df.")
  1. Main Function Logic:

    • Desc Method:
    if method.lower() in ["desc", "all"]:
        desc_kwargs = filter_kwargs('describe', kwargs, valid_kwargs)
        result.append(f"<<______DESCRIBE______>>\n{str(df.describe(**desc_kwargs))}\n")
    if method.lower() in ["head", "all"]:
        head_kwargs = filter_kwargs('head', kwargs, valid_kwargs)
        pd.set_option('display.max_columns', None)
        result.append(f"<<______HEAD______>>\n{str(df.head(**head_kwargs))}\n")
        pd.reset_option('display.max_columns')
    if method.lower() in ["info", "all"]:
        info_kwargs = filter_kwargs('info', kwargs, valid_kwargs)
        buffer = io.StringIO()
        df.info(buf=buffer, **info_kwargs)
        result.append(f"<<______INFO______>>\n{buffer.getvalue()}\n")
    if method.lower() in ["na", "all"]:
        na_count = df.isna().sum()
        na_percent = (df.isna().sum() / df.shape[0]) * 100
        result.append(f"<<______NA_COUNT______>>\n{na_count}\n")
        result.append(f"<<______NA_PERCENT______>>\n{na_percent}\n")
    combined_result = "\n".join(result)

    if output.lower() == 'print':
        print(combined_result)
    elif output.lower() == 'return':
        return combined_result

See the Full Function:

The full implementation can be found in the datasafari repository.