Implement new explore_df() methods: 'desc', 'head', 'info' & 'na'

ETA444 / datasafari

DataSafari simplifies complex data science tasks into straightforward, powerful one-liners.

GNU General Public License v3.0

2 stars 0 forks source link

# Summary statistics with custom percentiles explore_df(df, method='desc', output='print', percentiles=[0.05, 0.95]) # Display the first 3 rows explore_df(df, method='head', output='print', n=3) # Detailed DataFrame information explore_df(df, method='info', output='print', verbose=True) # Count and percentage of missing values explore_df(df, method='na', output='print') # Comprehensive exploration with custom settings explore_df(df, method='all', output='print', n=3, percentiles=[0.25, 0.75]) # Returning comprehensive exploration results as a string result_str = explore_df(df, method='all', output='return', n=5) print(result_str)

Implementation Summary:

The explore_df() function facilitates the exploration of a DataFrame using specified methods such as desc, head, info, na, or all of them together. The function also allows for control over the output format and provides method-specific parameters through keyword arguments (kwargs).

Code Breakdown:

Function Definition and Parameters:

def explore_df(
        df: pd.DataFrame,
        method: str = 'all',
        output: str = 'print',
        **kwargs
) -> Optional[str]:

Purpose: The explore_df() function provides a high-level overview of a DataFrame using various methods (desc, head, info, na, all).
Parameters:
- df: The DataFrame to explore.
- method: The exploration method to apply.
- output: Controls how the results are outputted (print or return).
- kwargs: Additional arguments for pandas methods.

Error Handling:
- Type Errors:

    if not isinstance(df, pd.DataFrame):
        raise TypeError("explore_df(): The df parameter must be a pandas DataFrame.")

    if not isinstance(method, str):
        raise TypeError("explore_df(): The method parameter must be a string.\nExample: method = 'all'")

    if not isinstance(output, str):
        raise TypeError("explore_df(): The output parameter must be a string.\nExample: output = 'print'")

Value Errors:

    if df.empty:
        raise ValueError("explore_df(): The input DataFrame is empty.")

    valid_methods = ['na', 'desc', 'head', 'info', 'all']
    if method.lower() not in valid_methods:
        raise ValueError(f"explore_df(): Invalid method '{method}'. Valid options are: {', '.join(valid_methods)}.")

    if output.lower() not in ['print', 'return']:
        raise ValueError("explore_df(): Invalid output method. Choose 'print' or 'return'.")

    if 'buf' in kwargs and method.lower() == 'info':
        raise ValueError("explore_df(): 'buf' parameter is not supported in the 'info' method within explore_df.")

Purpose: The error handling ensures that the input parameters are valid. This is crucial for preventing unexpected behavior or runtime errors.

Main Function Logic:
- Desc Method:

    if method.lower() in ["desc", "all"]:
        desc_kwargs = filter_kwargs('describe', kwargs, valid_kwargs)
        result.append(f"<<______DESCRIBE______>>\n{str(df.describe(**desc_kwargs))}\n")

Head Method:

    if method.lower() in ["head", "all"]:
        head_kwargs = filter_kwargs('head', kwargs, valid_kwargs)
        pd.set_option('display.max_columns', None)
        result.append(f"<<______HEAD______>>\n{str(df.head(**head_kwargs))}\n")
        pd.reset_option('display.max_columns')

Info Method:

    if method.lower() in ["info", "all"]:
        info_kwargs = filter_kwargs('info', kwargs, valid_kwargs)
        buffer = io.StringIO()
        df.info(buf=buffer, **info_kwargs)
        result.append(f"<<______INFO______>>\n{buffer.getvalue()}\n")

NA Method:

    if method.lower() in ["na", "all"]:
        na_count = df.isna().sum()
        na_percent = (df.isna().sum() / df.shape[0]) * 100
        result.append(f"<<______NA_COUNT______>>\n{na_count}\n")
        result.append(f"<<______NA_PERCENT______>>\n{na_percent}\n")

Combine and Output:

    combined_result = "\n".join(result)

    if output.lower() == 'print':
        print(combined_result)
    elif output.lower() == 'return':
        return combined_result

Purpose: The main function logic handles each exploration method (desc, head, info, na) and prepares the results accordingly. The output is either printed or returned based on the output parameter.

See the Full Function:

The full implementation can be found in the datasafari repository.

ETA444 / datasafari

Implement new explore_df() methods: 'desc', 'head', 'info' & 'na' #23

Description:

Module Functionality Idea:

How it operates:

Usage:

Methods:

1. Describe Method:

2. Head Method:

3. Info Method:

4. NA Method:

5. All Methods:

Example Usage: