pdfminer can't parse characters outside the ASCII encoding

marianorodriguez commented 5 years ago

Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...

peter-vandenabeele-axa commented 5 years ago

Sorry, I fail to understand the title, relative to the problem statement.

I understand this as "pdfminer can't parse characters outside the ASCII encoding" (because the "accent" characters from French and Spanish etc. that you refer to are part of utf-8).

Probably I am just misunderstanding ?

royjohal commented 5 years ago

Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...

It's actually more complicated than that. Also: this behavior can also manifest on standard latin characters: it all depends on how the PDF document was encoded - a PDF can lack a complete textual mapping by omission since it is not an inherent purpose of the PDF format to provide the textual equivalent.

Lines producing the '?': https://github.com/axa-group/Parsr/blob/658695392ae1ebb78bc0e16986445749c78bcacf/server/src/input/pdfminer/pdfminer.ts#L196-L205

Related Issues:

75
aarohijohal/pdfminer.six#1
pdfminer/pdfminer.six#130

royjohal commented 4 years ago

An interesting related patent: https://patents.google.com/patent/US20060288281

royjohal commented 4 years ago

This is a general problem related to the extractors used, instead of being particularly Parsr's problem.

axa-group / Parsr

pdfminer can't parse characters outside the ASCII encoding #136

75