euske / pdfminer

Python PDF Parser (Not actively maintained). Check out pdfminer.six.
https://github.com/pdfminer/pdfminer.six
MIT License
5.25k stars 1.13k forks source link

Getting data in CID Fonts #214

Open abhishek-jain-infrrd opened 6 years ago

abhishek-jain-infrrd commented 6 years ago

I am facing the issue where when using pdfminer to get the text out of pdf, I am getting each character as CID encoded for the pdf. But if I open up the pdf and select the text then I can copy it and use it.

Attaching the sample pdf. sample.pdf

h2ri commented 6 years ago

I had the similar issue for some of the pdf while parsing. (cid:160) and (cid:173) was in the places of spaces between the texts I have fixed this error by adding - ('space', None, 202, 160, None), ('space', None, 202, 173, None), to the latin_enc.py file.

Hope it helps