deanmalmgren / textract

extract text from any document. no muss. no fuss.
http://textract.readthedocs.io
MIT License
3.89k stars 599 forks source link

Different result compared to when extracting directly with pdftotext #330

Open filipopo opened 4 years ago

filipopo commented 4 years ago

Don't know if it's an update issue or what but bytes aren't the only problem. textract:

91. Registration number is the official set of numbers and letters shown on the front and back of vehicle on the
__________________________________.
а)
b)
c)
d)

Licence plate.
Number board.
Register table.
Number place.

It's like this in the bytes too: xd0\xb0)\nb)\nc)\nd)\n\nLicence plate.\nNumber board.\nRegister table.\nNumber place. pdftotext:

91. Registration number is the official set of numbers and letters shown on the front and back of vehicle on the
    __________________________________.
      а)   Licence plate.
     b)    Number board.
      c)   Register table.
     d)    Number place.

Here, try it yourself:

Python program that saves the results of converting files using pdftotext and textract into different files: https://github.com/Filip98/congenial-bassoon/blob/master/a.py Sample file: https://nissrednjastrucna.edu.rs/data/documents/Opsti-deo.pdf