Open gadgetlabs opened 5 days ago
@gadgetlabs I can reproduce this error with the default docling settings.
But I can successfully convert it by switching to docling-parse-v2
backend, see:
docling --pdf-backend=dlparse_v2 the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf
Bug
Following exception raised RuntimeError: [json.exception.parse_error.101] parse error at line 1246, column 42: syntax error while parsing value - invalid string: control character U+0018 (CAN) must be escaped to \u0018; last read: '"- <U+0018>' relating to parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no).
Issue occurs due to docling-parse not handling the empty values when producing JSON, I think.
Example PDF can be found here (attached also) https://etc.usf.edu/lit2go/pdf/passage/348/the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf
the-adventures-of-sherlock-holmes-004-adventure-4-the-boscombe-valley-mystery.pdf
Unclear whether to fail gracefully or ignore hence reporting as bug instead of fix.