Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
8.62k stars 704 forks source link

partition pdf, doc and pptx doesn't work for file bytes #3244

Closed sixftninja closed 3 months ago

sixftninja commented 3 months ago

When I provide a file path to partition pdf, docx or pptx, everything works fine. however when I do:

with open(file_path, 'rb') as f:
    file_content = f.read()

I get the following errors: .pdf: local variable 'err' referenced before assignment .docx, .pptx: 'bytes' object has no attribute 'seek'

To Reproduce

import os
import json
from tqdm import tqdm
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.docx import partition_docx

partition_functions = {
    '.pdf': partition_pdf,
    '.pptx': partition_pptx,
    '.docx': partition_docx,
    # Add other partition functions here
}

params_dict = {
    '.pdf': {'include_page_breaks': True, 'strategy': 'hi_res', 'infer_table_structure': True, 'include_metadata': True, 'hi_res_model_name': 'detectron2_onnx'},
    '.pptx': {'date_from_file_object': False, 'detect_language_per_element': False, 'include_page_breaks': True, 'include_slide_notes': None, 'infer_table_structure': False, 'languages': ["auto"], 'metadata_filename': None, 'metadata_last_modified': None, 'starting_page_number': 1, 'strategy': 'fast'},
    '.docx': {'date_from_file_object': False, 'detect_language_per_element': False, 'include_page_breaks': True, 'infer_table_structure': True, 'languages': ["auto"], 'metadata_filename': None, 'metadata_last_modified': None, 'starting_page_number': 1, 'strategy': None},
    # Add other params here
}

def partition_document(file_content, file_metadata):
    file_extension = os.path.splitext(file_metadata['name'])[1].lower()
    try:
        partition_func = partition_functions[file_extension]
        params = params_dict[file_extension]
        elements = partition_func(file_content, **params)
        elements_dict = [element.to_dict() for element in elements]
        for element in elements_dict:
            if 'orig_elements' in element['metadata']:
                del element['metadata']['orig_elements']
        return True, elements_dict
    except Exception as e:
        return False, []

if __name__ == "__main__":
    file_directory = '/path/to/example_docs'
    output_file_path = '/path/to/partition_test.txt'

    all_files = [f for f in os.listdir(file_directory) if any(f.endswith(ext) for ext in partition_functions)]
    results = []

    for file in tqdm(all_files, desc="Processing files"):
        file_path = os.path.join(file_directory, file)
        with open(file_path, 'rb') as f:
            file_content = f.read()  # Read file content as bytes

        file_metadata = {'name': file}

        success, elements = partition_document(file_content, file_metadata)

        if success:
            content = f"Results for {file}:\n"
            for element in elements:
                content += json.dumps(element, indent=2) + "\n"
            results.append(content)
        else:
            results.append(f"Error processing {file}: Partitioning failed\n")

    with open(output_file_path, 'w') as f:
        for result in results:
            f.write(result)
            f.write("\n")

    print("Finished processing files.")

Expected behavior The library should successfully partition .pdf, .docx, and .pptx files when provided as byte streams (file_content), similar to how it handles file paths.

Environment Info I'm running unstructured open source in a docker container.

OS version: Linux-6.6.22-linuxkit-aarch64-with-glibc2.36 Python version: 3.10.14 unstructured version: 0.14.6 unstructured-inference version: 0.7.35 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 is not installed

PaddleOCR is not installed

Additional info I have not tested any other supported file extension, just these 3.

scanny commented 3 months ago

There is no option for sending the file contents as bytes to those partitioners.

You can send a path or a file-like object, like:

partition_docx(filename="document.docx")
# -- OR --
with open("document.docx", "rb") as f:
    partition_docx(file=f)

Note that in the second case you use f, which is a file-like object (IO[bytes] type), not f.read() which is bytes.

If for whatever reason you prefer to work with the bytes in the file you can wrap those with io.BytesIO:

import io

with open("document.docx", "rb") as f:
    file = io.BytesIO(f.read())

partition_docx(file=file)

Note you need the keyword arguments to let the partitioner know which of these two file-source options you're choosing.

sixftninja commented 3 months ago

Thank you for the explanation!