Closed danielkorzekwa closed 4 days ago
I suggest using the parse v2 (soon becoming the default). I confirm the paper is processing correctly with the following options
docling --pdf-backend dlparse_v2 https://arxiv.org/pdf/2410.06488
For using in the code, look here https://github.com/DS4SD/docling/blob/main/docling/cli/main.py#L265
Bug
Converting a pdf with Chinese characters fails with "syntax error while parsing object key" exception
... ArxivService.extract_text_from_pdf(self, arxiv_id) 41 source = self._get_pdf_filepath(arxiv_id) 42 converter = DocumentConverter() ---> 43 result = converter.convert(source) 44 pdf_text = result.document.export_to_markdown() ... miniconda3/envs/bfs/lib/python3.12/site-packages/docling/backend/docling_parse_backend.py#line=24), in DoclingParsePageBackend.init(self, parser, document_hash, page_no, page_obj) 21 def init( 22 self, parser: pdf_parser_v1, document_hash: str, page_no: int, page_obj: PdfPage 23 ): 24 self._ppage = page_obj ---> 25 parsed_page = parser.parse_pdf_from_key_on_page(document_hash, page_no) 27 self.valid = "pages" in parsed_page 28 if self.valid:
RuntimeError: [json.exception.parse_error.101] parse error at line 13, column 36: syntax error while parsing object key - invalid string: control character U+001F (US) must be escaped to \u001F; last read: '"/PVOXJK+FlexiFontBZ?<U+001F>'; expected string literal https://arxiv.org/pdf/2410.06488
Steps to reproduce
Parse this pdf: https://arxiv.org/pdf/2410.06488
Docling version
pip list |grep docling docling 2.4.2 docling-core 2.3.1 docling-ibm-models 2.0.3 docling-parse 2.0.3
Python version
Python 3.12.4