DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.51k stars 508 forks source link

Add timeout limit to document parsing job. #270

Open PeterStaar-IBM opened 2 weeks ago

PeterStaar-IBM commented 2 weeks ago

Requested feature

We need to have a way to add a timeout parameter when processing a document. Currently, it happens in very rare cases that certain documents will take very long to convert. In a batch processing job, this might become problematic.

example use case:

temp.pdf

cau-git commented 2 weeks ago

Checking the attached PDF, it is not a surprise we see very long conversion time. It is fully scanned and has a lot of pages, which is very slow on CPU at least.

Generally, there are multiple strategies to avoid such samples clogging a bulk conversion pipeline.

  1. One can run over all docs with OCR off, and later rerun only those docs where the conversion result is empty (i.e. it may need OCR). Already possible with current version.
  2. We can extend docling to optionally stop converting a doc when a timeout is reached. This timeout can only be checked once after every next page batch (i.e. after multiples of 4 pages with the defaults). This would reflect as a status PARTIAL_SUCCESS. User code could either export the partial result or drop the document.
ab-shrek commented 1 week ago

I am interested in this issue. Can you please assign this to me? Thanks :)

ab-shrek commented 1 week ago

Are you working on this @nikos-livathinos ?

nikos-livathinos commented 1 week ago

@ab-shrek great to see you are interested in helping out on this issue. Please submit a PR for our review. Here are some hints:

  1. Introduce a new parameter (e.g. pdf_document_timeout) in PdfPipelineOptions (https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/datamodel/pipeline_options.py#L71)
  2. Implement the timeout logic in the PaginatedPipeline._build_document() (https://github.com/DS4SD/docling/blob/c6b3763ecb6ef862840a30978ee177b907f86505/docling/pipeline/base_pipeline.py#L118)
    • The timeout should apply to the PDF pipeline for the time needed to convert the entire document.
    • We should check for a timeout after the conversion of each page chunk (but the check is for the document not only for the current page chunk).
    • When a timeout happens, the loop exits and the conv_res.status should set to ConversionStatus.PARTIAL_SUCCESS.
  3. Extend the docling CLI (https://github.com/DS4SD/docling/blob/main/docling/cli/main.py) to expose a cmd argument (e.g. --document-timeout ) that sets the pdf_document_timeout inside the PdfPipelineOptions.
ab-shrek commented 1 week ago

Great; thanks @nikos-livathinos. Let me get on this asap :)