Research > Document issues with current pdf parsing with examples

Currently, the prototype extracts all PDF text elements and segments into pages, spans and sentences, and makes an attempt to retain natural reading order. This means that text in headers, footers, tables, and figures are also extracted, but often inserted into incorrect places. It is not able to reliably extract multi-columnar layouts as it depends on heuristic rules to separate columns. Contiguous text spans are often split up, resulting in incorrect ordering of spans.

Document the specific issues with the current PDF parser with examples. This will enable us to design an improved solution to PDF parsing that overcomes these issues.

Issue #2 contains a list of current issues that should be expanded.

[ ] Review examples and document issues
[ ] Document requirements for improved pdf parsing

Table of contents pages containing headings separated from page numbers by multiple repeated symbols such as ".". The repeated characters and page numbers are removed, but the titles remain in the extracted text. Example: cclw-10086-357cef7658b8440b823e4c76c0b09745.pdf In some pdfs, there is hidden text on a page which is not removed by the PyMuPDF scrub() method. This results in repeated text and text from other pages being output. Example: cclw-10086-357cef7658b8440b823e4c76c0b09745.pdf this pymupdf issue might be relevant

climatepolicyradar / policy-search

Research > Document issues with current pdf parsing with examples #91