axa-group / Parsr

Transforms PDF, Documents and Images into Enriched Structured Data
Apache License 2.0
5.81k stars 310 forks source link

pdfminer can't parse characters outside the ASCII encoding #136

Closed marianorodriguez closed 4 years ago

marianorodriguez commented 5 years ago

Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...

Screenshot 2019-10-17 at 16 33 40

caixa-one-page-spanish.pdf

peter-vandenabeele-axa commented 5 years ago

Sorry, I fail to understand the title, relative to the problem statement.

I understand this as "pdfminer can't parse characters outside the ASCII encoding" (because the "accent" characters from French and Spanish etc. that you refer to are part of utf-8).

Probably I am just misunderstanding ?

royjohal commented 5 years ago

Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...

It's actually more complicated than that. Also: this behavior can also manifest on standard latin characters: it all depends on how the PDF document was encoded - a PDF can lack a complete textual mapping by omission since it is not an inherent purpose of the PDF format to provide the textual equivalent.

Lines producing the '?': https://github.com/axa-group/Parsr/blob/658695392ae1ebb78bc0e16986445749c78bcacf/server/src/input/pdfminer/pdfminer.ts#L196-L205

Related Issues:

  1. 75

  2. aarohijohal/pdfminer.six#1
  3. pdfminer/pdfminer.six#130
royjohal commented 4 years ago

An interesting related patent: https://patents.google.com/patent/US20060288281

royjohal commented 4 years ago

This is a general problem related to the extractors used, instead of being particularly Parsr's problem.