arisp8 / gazette-analysis

GNU General Public License v3.0
7 stars 1 forks source link

PDF text extraction - Improving accuracy for gazette documents #11

Open arisp8 opened 6 years ago

arisp8 commented 6 years ago

Text extraction from the pdf's is not always 100% accurate because the gazette documents always have 2 columns of text and when they're too close to eachother sentences or words can be mixed up with the words from the other column.

For that reason I have created unit tests in pdf_parsers_tests.py to test the accuracy of the extracted names from signatures in an effort to have 100% accurate data extraction for signatures.

gthd commented 6 years ago

Hi @arisp8 I would like to deal with this issue.