daliposc / osu-salaries

parse salary .pdf into pandas df to analyze salary by department type
1 stars 1 forks source link

Entries at page ends are sometimes missed #1

Closed Neato-Nick closed 9 months ago

Neato-Nick commented 9 months ago

An edge case in parsing prevents all entries from getting read.

I think the specific cause is that at page ends, the opening or closing hyphenation lines are across different pages.

Here are two examples with the CSV open side-by-side with the PDF:

Screenshot 2024-02-04 at 9 58 01 AM Screenshot 2024-02-04 at 9 55 18 AM