DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
11.94k stars 596 forks source link

invalid string: control character U+0018 #445

Open gadgetlabs opened 5 days ago

gadgetlabs commented 5 days ago

Bug

Following exception raised RuntimeError: [json.exception.parse_error.101] parse error at line 1246, column 42: syntax error while parsing value - invalid string: control character U+0018 (CAN) must be escaped to \u0018; last read: '"- <U+0018>' relating to parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no).

Issue occurs due to docling-parse not handling the empty values when producing JSON, I think.

Example PDF can be found here (attached also) https://etc.usf.edu/lit2go/pdf/passage/348/the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf

the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf

Unclear whether to fail gracefully or ignore hence reporting as bug instead of fix.

cau-git commented 4 days ago

@gadgetlabs I can reproduce this error with the default docling settings. But I can successfully convert it by switching to docling-parse-v2 backend, see:

docling --pdf-backend=dlparse_v2 the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf