Closed sixftninja closed 3 months ago
There is no option for sending the file contents as bytes to those partitioners.
You can send a path or a file-like object, like:
partition_docx(filename="document.docx")
# -- OR --
with open("document.docx", "rb") as f:
partition_docx(file=f)
Note that in the second case you use f
, which is a file-like object (IO[bytes]
type), not f.read()
which is bytes.
If for whatever reason you prefer to work with the bytes in the file you can wrap those with io.BytesIO
:
import io
with open("document.docx", "rb") as f:
file = io.BytesIO(f.read())
partition_docx(file=file)
Note you need the keyword arguments to let the partitioner know which of these two file-source options you're choosing.
Thank you for the explanation!
When I provide a file path to partition pdf, docx or pptx, everything works fine. however when I do:
I get the following errors: .pdf: local variable 'err' referenced before assignment .docx, .pptx: 'bytes' object has no attribute 'seek'
To Reproduce
Expected behavior The library should successfully partition .pdf, .docx, and .pptx files when provided as byte streams (file_content), similar to how it handles file paths.
Environment Info I'm running unstructured open source in a docker container.
OS version: Linux-6.6.22-linuxkit-aarch64-with-glibc2.36 Python version: 3.10.14 unstructured version: 0.14.6 unstructured-inference version: 0.7.35 pytesseract version: 0.3.10 Torch version: 2.3.1 Detectron2 is not installed
PaddleOCR is not installed
Additional info I have not tested any other supported file extension, just these 3.