Open PeterStaar-IBM opened 2 weeks ago
Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.
Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.
PARTIAL_SUCCESS
. User code could either export the partial result or drop the document.I am interested in this issue. Can you please assign this to me? Thanks :)
Are you working on this @nikos-livathinos ?
@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review. Here are some hints:
pdf_document_timeout
) in PdfPipelineOptions
(https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/datamodel/pipeline_options.py#L71)PaginatedPipeline._build_document()
(https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/pipeline/base_pipeline.py#L118)
conv_res.status
should set to ConversionStatus.PARTIAL_SUCCESS
.--document-timeout
) that sets the pdf_document_timeout
inside the PdfPipelineOptions
.Great; thanks @nikos-livathinos. Let me get on this asap :)
Requested feature
We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.
example use case:
temp.pdf