PDF text extraction is missing pages - Githubissues

1jamesthompson1 / TAIC-report-summary

Using LLM technologies to analyze transport accident investigation reports

https://taic-viewer-72e8675c1c03.herokuapp.com/

GNU General Public License v3.0

0 stars 0 forks source link

PDF text extraction is missing pages #164

Open 1jamesthompson1 opened 1 month ago

1jamesthompson1 commented 1 month ago

Problem

Currently in the PDFParser the PDFs are parsed into text. There is a problem where some of the pages are missed out.

This affectes sections extraction for #146, for two reasons:

Some of the sections are missed out as they are not in the txt file
The section that is before the missing page will capture until the next higher section as it cant find the end of its own section (becuase it finds the end of its section by trying to find the satrt of the next section).

Ideas and suggestions

Links and references