Open acsankar opened 13 hours ago
@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?
@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?
Thanks for the reply. The whole stack is below. I am using Colab and trying to access the documents in Google bucket. Similar to python-docx, I was trying to use IO.bytes but Dockling expects the .docx format so converting it to a temp file and trying to load it in docling. This is working for few other document but doesn't work for one of the 200 page document so I thought it could be due to pictures in the document to see if there is anyway to skip that for .docx format like Docling has suppress image reading while converting to Docling document. Please let me know if it helps
OSError Traceback (most recent call last)
18 frames /usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self) 375 if loader is None: 376 msg = f"cannot find loader for this {self.format} file" --> 377 raise OSError(msg) 378 image = loader.load(self) 379 assert image is not None
OSError: cannot find loader for this WMF file
Below is the code:
from google.cloud import storage import tempfile
storage_client = storage.Client()
gcs_path = file_path bucket_name = bucket_name blob_name = f"{source_folder}/{file_name}"
bucket = storage_client.bucket(bucket_name) blob = bucket.blob(blob_name)
print(bucket) print(blob_name)
temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.docx') print(temp_file) blob.download_to_filename(temp_file.name)
doc_converter = DocumentConverter( allowed_formats=[InputFormat.DOCX], format_options={ InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline), }, )
result = doc_converter.convert(temp_file.name) markdown_file_np = result.document.export_to_markdown() print(markdown_file_np)
I am trying to use this in cloud and just trying to convert it to markdown without images. Assuming below error is coming when there are images in document. Any suggestions to fix this?
doc_converter = DocumentConverter( allowed_formats=[InputFormat.DOCX], format_options={ InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline), }, )
I am getting below error ---> 30 result = doc_converter.convert(temp_file.name)
18 frames /usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self) 375 if loader is None: 376 msg = f"cannot find loader for this {self.format} file" --> 377 raise OSError(msg) 378 image = loader.load(self) 379 assert image is not None
OSError: cannot find loader for this WMF file