DS4SD / docling

Get your documents ready for gen AI
https://ds4sd.github.io/docling
MIT License
10.48k stars 507 forks source link

Syntax error while parsing object key (pdf with Chinese characters) #351

Closed danielkorzekwa closed 4 days ago

danielkorzekwa commented 6 days ago

Bug

Converting a pdf with Chinese characters fails with "syntax error while parsing object key" exception

... ArxivService.extract_text_from_pdf(self, arxiv_id) 41 source = self._get_pdf_filepath(arxiv_id) 42 converter = DocumentConverter() ---> 43 result = converter.convert(source) 44 pdf_text = result.document.export_to_markdown() ... miniconda3/envs/bfs/lib/python3.12/site-packages/docling/backend/docling_parse_backend.py#line=24), in DoclingParsePageBackend.init(self, parser, document_hash, page_no, page_obj) 21 def init( 22 self, parser: pdf_parser_v1, document_hash: str, page_no: int, page_obj: PdfPage 23 ): 24 self._ppage = page_obj ---> 25 parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no) 27 self.valid = "pages" in parsed_page 28 if self.valid:

RuntimeError: [json.exception.parse_error.101] parse error at line 13, column 36: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"/PVOXJK+FlexiFontBZ?<U+001F>'; expected string literal https://arxiv.org/pdf/2410.06488

Steps to reproduce

Parse this pdf: https://arxiv.org/pdf/2410.06488

Docling version

pip list |grep docling docling 2.4.2 docling-core 2.3.1 docling-ibm-models 2.0.3 docling-parse 2.0.3

Python version

Python 3.12.4

dolfim-ibm commented 4 days ago

I suggest using the parse v2 (soon becoming the default). I confirm the paper is processing correctly with the following options

docling --pdf-backend dlparse_v2  https://arxiv.org/pdf/2410.06488

For using in the code, look here https://github.com/DS4SD/docling/blob/main/docling/cli/main.py#L265