Unstructured-IO / unstructured

Open source libraries and APIs to build custom preprocessing pipelines for labeling, training, or production machine learning pipelines.
https://www.unstructured.io/
Apache License 2.0
7.79k stars 623 forks source link

bug/windows reopen temp file (pdf hi_res) #3076

Open KristianMischke opened 2 months ago

KristianMischke commented 2 months ago

Describe the bug Same issue as https://github.com/Unstructured-IO/unstructured-inference/issues/303, I couldn't find an equivalent ticket on this project. Temp files run into an issue in Windows when they are opened/closed within the scope of the NamedTemporaryFile()

In line: https://github.com/Unstructured-IO/unstructured/blob/d3a404cfb541dae8e16956f096bac99fc05c985b/unstructured/partition/pdf_image/ocr.py#L79

is a temp file created to pass as filename to process_file_with_ocr -> pdf2image.convert_from_path which then invokes pdfinfo on the tempfile yielding an error like

pdf2image.exceptions.PDFPageCountError: Unable to get page count.
I/O Error: Couldn't open file <temp file path here>: No error.

To Reproduce On Windows

Note: first the issue outlined in https://github.com/Unstructured-IO/unstructured-inference/issues/303 will occur, but once that is fixed (e.g. by applying https://github.com/Unstructured-IO/unstructured-inference/pull/323) it will error on the ocr code as mentioned above

import tempfile

# print operating system name
import os
print(os.name)

# Create a temporary file
with tempfile.NamedTemporaryFile() as tmp_file:
    # Write some data to the file
    tmp_file.write(b'Hello, world!')
    tmp_file.flush()  # Flush the buffer to make sure data is written

    # Get the name of the file
    file_name = tmp_file.name

    # Since the file is closed after the with block, we need to open it again for reading
    with open(file_name, 'r') as file:
        # Read the data from the file
        content = file.read()
        print("Content of the temp file:", content)

Expected behavior Expected not to error, and to be able to support tempfiles on Windows

MthwRobinson commented 2 months ago

Thanks @KristianMischke ! We'll take a look at this.