climatepolicyradar / policy-search

0 stars 1 forks source link

Research > Document issues with current pdf parsing with examples #91

Open chrisaballard opened 2 years ago

chrisaballard commented 2 years ago

Currently, the prototype extracts all PDF text elements and segments into pages, spans and sentences, and makes an attempt to retain natural reading order. This means that text in headers, footers, tables, and figures are also extracted, but often inserted into incorrect places. It is not able to reliably extract multi-columnar layouts as it depends on heuristic rules to separate columns. Contiguous text spans are often split up, resulting in incorrect ordering of spans.

Document the specific issues with the current PDF parser with examples. This will enable us to design an improved solution to PDF parsing that overcomes these issues.

Issue #2 contains a list of current issues that should be expanded.

chrisaballard commented 2 years ago

This issue collects together the various problems with the current heuristic based pdf parsing that has been implemented using PyMuPDF.

Table of contents pages containing headings separated from page numbers by multiple repeated symbols such as ".". The repeated characters and page numbers are removed, but the titles remain in the extracted text. Example: cclw-10086-357cef7658b8440b823e4c76c0b09745.pdf
In some pdfs, there is hidden text on a page which is not removed by the PyMuPDF scrub() method. This results in repeated text and text from other pages being output. Example: cclw-10086-357cef7658b8440b823e4c76c0b09745.pdf
    this pymupdf issue might be relevant