aws-samples / amazon-textract-textractor

Analyze documents with Amazon Textract and generate output in multiple formats.
Apache License 2.0
407 stars 145 forks source link

new feature: export_kv_to_pandas #405

Open Chuukwudi opened 1 week ago

Chuukwudi commented 1 week ago

Description of Changes: The export_kv_to_pandas function is a new addition that converts key-value pairs with optional confidence filtering into a pandas DataFrame. I find myself needing to use this quite a lot. This PR introduces a safe import check for the pandas library, ensuring that the function only runs if pandas is available, thus improving error handling. Additionally, we have added configurable options for confidence filtering, trimming whitespace and punctuation, and optional inclusion of the confidence score.

Details of Changes:

Testing:

Example Code: Here’s the full function code added in this PR:

def export_kv_to_pandas(
    self, 
    trim: bool = True, 
    confidence_threshold: float = 0.0, 
    drop_confidence: bool = True
) -> 'DataFrame':
    """
    Converts key-value pairs with optional confidence filtering into a pandas DataFrame.

    :param trim: Flag to trim whitespace and punctuation from key and value text. Default is True.
    :type trim: bool
    :param confidence_threshold: Minimum confidence level required for a key-value pair to be included in the DataFrame. Default is 0.0.
    :type confidence_threshold: float
    :param drop_confidence: Flag to exclude the confidence column from the DataFrame. Default is True.
    :type drop_confidence: bool

    :return: A pandas DataFrame containing key-value pairs, and optionally their confidence scores.
             The DataFrame will have 'Key' and 'Value' columns, and a 'Confidence' column if `drop_confidence` is False.
    :rtype: pd.DataFrame
    """
    try:
        from pandas import DataFrame
    except ImportError:
        raise MissingDependencyException(
            "pandas library is required for exporting tables to DataFrame objects or markdown."
        )

    keys: list[str] = []
    values: list[str] = []
    confidences: Optional[list[float]] = [] if not drop_confidence else None

    # Loop through key-values and filter by confidence threshold
    for row in self.key_values:
        if row.confidence > confidence_threshold:
            keys.append(row.key.text.strip(": ").strip() if trim else row.key.text)
            values.append(row.value.text.strip() if trim else row.value.text)
            if confidences is not None:
                confidences.append(row.confidence)

    data = {'Key': keys, 'Value': values}
    if confidences is not None:
        data['Confidence'] = confidences

    return DataFrame(data)

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution under the terms of your choice.