Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
Apache License 2.0
8.62k stars 704 forks source link

partition pdf, doc and pptx doesn't work for file bytes #3244

Closed sixftninja closed 3 months ago

sixftninja commented 3 months ago

When I provide a file path to partition pdf, docx or pptx, everything works fine. however when I do:

with open(file_path, 'rb') as f:
    file_content = f.read()

I get the following errors: .pdf: local variable 'err' referenced before assignment .docx, .pptx: 'bytes' object has no attribute 'seek'

To Reproduce

import os
import json
from tqdm import tqdm
from unstructured.partition.pdf import partition_pdf
from unstructured.partition.pptx import partition_pptx
from unstructured.partition.docx import partition_docx

partition_functions = {
    '.pdf': partition_pdf,
    '.pptx': partition_pptx,
    '.docx': partition_docx,
    # Add other partition functions here

params_dict = {
    '.pdf': {'include_page_breaks': True, 'strategy': 'hi_res', 'infer_table_structure': True, 'include_metadata': True, 'hi_res_model_name': 'detectron2_onnx'},
    '.pptx': {'date_from_file_object': False, 'detect_language_per_element': False, 'include_page_breaks': True, 'include_slide_notes': None, 'infer_table_structure': False, 'languages': ["auto"], 'metadata_filename': None, 'metadata_last_modified': None, 'starting_page_number': 1, 'strategy': 'fast'},
    '.docx': {'date_from_file_object': False, 'detect_language_per_element': False, 'include_page_breaks': True, 'infer_table_structure': True, 'languages': ["auto"], 'metadata_filename': None, 'metadata_last_modified': None, 'starting_page_number': 1, 'strategy': None},
    # Add other params here

def partition_document(file_content, file_metadata):
    file_extension = os.path.splitext(file_metadata['name'])[1].lower()
        partition_func = partition_functions[file_extension]
        params = params_dict[file_extension]
        elements = partition_func(file_content, **params)
        elements_dict = [element.to_dict() for element in elements]
        for element in elements_dict:
            if 'orig_elements' in element['metadata']:
                del element['metadata']['orig_elements']
        return True, elements_dict
    except Exception as e:
        return False, []

if __name__ == "__main__":
    file_directory = '/path/to/example_docs'
    output_file_path = '/path/to/partition_test.txt'

    all_files = [f for f in os.listdir(file_directory) if any(f.endswith(ext) for ext in partition_functions)]
    results = []

    for file in tqdm(all_files, desc="Processing files"):
        file_path = os.path.join(file_directory, file)
        with open(file_path, 'rb') as f:
            file_content = f.read()  # Read file content as bytes

        file_metadata = {'name': file}

        success, elements = partition_document(file_content, file_metadata)

        if success:
            content = f"Results for {file}:\n"
            for element in elements:
                content += json.dumps(element, indent=2) + "\n"
            results.append(f"Error processing {file}: Partitioning failed\n")

    with open(output_file_path, 'w') as f:
        for result in results:

    print("Finished processing files.")

Expected behavior The library should successfully partition .pdf, .docx, and .pptx files when provided as byte streams (file_content), similar to how it handles file paths.

Environment Info I'm running unstructured open source in a docker container.

OS version: Linux-6.6.22-linuxkit-aarch64-with-glibc2.36 Python version: 3.10.14 unstructured version: 0.14.6 unstructured-inference version: 0.7.35 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 is not installed

PaddleOCR is not installed

Additional info I have not tested any other supported file extension, just these 3.

scanny commented 3 months ago

There is no option for sending the file contents as bytes to those partitioners.

You can send a path or a file-like object, like:

# -- OR --
with open("document.docx", "rb") as f:

Note that in the second case you use f, which is a file-like object (IO[bytes] type), not f.read() which is bytes.

If for whatever reason you prefer to work with the bytes in the file you can wrap those with io.BytesIO:

import io

with open("document.docx", "rb") as f:
    file = io.BytesIO(f.read())


Note you need the keyword arguments to let the partitioner know which of these two file-source options you're choosing.

sixftninja commented 3 months ago

Thank you for the explanation!