new feature: export_kv_to_pandas

Description of Changes: The export_kv_to_pandas function is a new addition that converts key-value pairs with optional confidence filtering into a pandas DataFrame. I find myself needing to use this quite a lot. This PR introduces a safe import check for the pandas library, ensuring that the function only runs if pandas is available, thus improving error handling. Additionally, we have added configurable options for confidence filtering, trimming whitespace and punctuation, and optional inclusion of the confidence score.

Details of Changes:

Safe Import Check:
- Added an inline import check for pandas.DataFrame. If pandas is not installed, a MissingDependencyException is raised with a clear message: "pandas library is required for exporting tables to DataFrame objects or markdown."
Functionality Enhancements:
- trim: Parameter to remove extra whitespace and punctuation in key-value strings.
- confidence_threshold: Parameter to set a minimum confidence level for key-value pairs to be included in the output DataFrame.
- drop_confidence: Parameter to exclude the confidence column from the output DataFrame if it is not needed.
Documentation and Type Annotations:
- Detailed docstring explaining parameters, return type, and conditions under which the Confidence column is included.
- Type annotations for improved readability and maintainability.

Testing:

Verified the function with and without pandas installed to ensure that the MissingDependencyException is raised appropriately.
Tested various configurations of confidence_threshold, trim, and drop_confidence to confirm they yield expected DataFrame outputs.
Validated that the function only includes the Confidence column when drop_confidence is False.

Example Code: Here’s the full function code added in this PR:

def export_kv_to_pandas(
    self, 
    trim: bool = True, 
    confidence_threshold: float = 0.0, 
    drop_confidence: bool = True
) -> 'DataFrame':
    """
    Converts key-value pairs with optional confidence filtering into a pandas DataFrame.

    :param trim: Flag to trim whitespace and punctuation from key and value text. Default is True.
    :type trim: bool
    :param confidence_threshold: Minimum confidence level required for a key-value pair to be included in the DataFrame. Default is 0.0.
    :type confidence_threshold: float
    :param drop_confidence: Flag to exclude the confidence column from the DataFrame. Default is True.
    :type drop_confidence: bool

    :return: A pandas DataFrame containing key-value pairs, and optionally their confidence scores.
             The DataFrame will have 'Key' and 'Value' columns, and a 'Confidence' column if `drop_confidence` is False.
    :rtype: pd.DataFrame
    """
    try:
        from pandas import DataFrame
    except ImportError:
        raise MissingDependencyException(
            "pandas library is required for exporting tables to DataFrame objects or markdown."
        )

    keys: list[str] = []
    values: list[str] = []
    confidences: Optional[list[float]] = [] if not drop_confidence else None

    # Loop through key-values and filter by confidence threshold
    for row in self.key_values:
        if row.confidence > confidence_threshold:
            keys.append(row.key.text.strip(": ").strip() if trim else row.key.text)
            values.append(row.value.text.strip() if trim else row.value.text)
            if confidences is not None:
                confidences.append(row.confidence)

    data = {'Key': keys, 'Value': values}
    if confidences is not None:
        data['Confidence'] = confidences

    return DataFrame(data)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution under the terms of your choice.

aws-samples / amazon-textract-textractor

new feature: export_kv_to_pandas #405