DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Using .DOCX format in cloud - suggestion on the below error? #410

Open acsankar opened 13 hours ago

acsankar commented 13 hours ago

I am trying to use this in cloud and just trying to convert it to markdown without images. Assuming below error is coming when there are images in document. Any suggestions to fix this?

doc_converter = DocumentConverter( allowed_formats=[InputFormat.DOCX], format_options={ InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline), }, )

I am getting below error ---> 30 result = doc_converter.convert(temp_file.name)

18 frames /usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self) 375 if loader is None: 376 msg = f"cannot find loader for this {self.format} file" --> 377 raise OSError(msg) 378 image = loader.load(self) 379 assert image is not None

OSError: cannot find loader for this WMF file

PeterStaar-IBM commented 8 hours ago

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

acsankar commented 7 hours ago

@acsankar I want to help you here, but I think we need a bit more context. Can you give us the full stacktrace?

Thanks for the reply. The whole stack is below. I am using Colab and trying to access the documents in Google bucket. Similar to python-docx, I was trying to use IO.bytes but Dockling expects the .docx format so converting it to a temp file and trying to load it in docling. This is working for few other document but doesn't work for one of the 200 page document so I thought it could be due to pictures in the document to see if there is anyway to skip that for .docx format like Docling has suppress image reading while converting to Docling document. Please let me know if it helps

<tempfile._TemporaryFileWrapper object at 0x7bfe9c3b00d0> /tmp/tmpxytgsj20.docx

OSError Traceback (most recent call last) in <cell line: 46>() 44 45 ---> 46 result = doc_converter.convert(temp_file.name) 47 markdown_file_np = result.document.export_to_markdown() 48 print(markdown_file_np)

18 frames /usr/local/lib/python3.10/dist-packages/PIL/ImageFile.py in load(self) 375 if loader is None: 376 msg = f"cannot find loader for this {self.format} file" --> 377 raise OSError(msg) 378 image = loader.load(self) 379 assert image is not None

OSError: cannot find loader for this WMF file

Below is the code:

convert to markdown

from google.cloud import storage import tempfile

storage_client = storage.Client()

gcs_path = file_path bucket_name = bucket_name blob_name = f"{source_folder}/{file_name}"

bucket = storage_client.bucket(bucket_name) blob = bucket.blob(blob_name)

print(bucket) print(blob_name)

temp_file = tempfile.NamedTemporaryFile(delete=False, suffix='.docx') print(temp_file) blob.download_to_filename(temp_file.name)

doc_converter = DocumentConverter( allowed_formats=[InputFormat.DOCX], format_options={ InputFormat.DOCX: WordFormatOption(pipeline_cls=SimplePipeline), }, )

result = doc_converter.convert(temp_file.name) markdown_file_np = result.document.export_to_markdown() print(markdown_file_np)