Closed marianorodriguez closed 4 years ago
Sorry, I fail to understand the title, relative to the problem statement.
I understand this as "pdfminer can't parse characters outside the ASCII encoding" (because the "accent" characters from French and Spanish etc. that you refer to are part of utf-8).
Probably I am just misunderstanding ?
Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...
It's actually more complicated than that. Also: this behavior can also manifest on standard latin characters: it all depends on how the PDF document was encoded - a PDF can lack a complete textual mapping by omission since it is not an inherent purpose of the PDF format to provide the textual equivalent.
Lines producing the '?': https://github.com/axa-group/Parsr/blob/658695392ae1ebb78bc0e16986445749c78bcacf/server/src/input/pdfminer/pdfminer.ts#L196-L205
Related Issues:
An interesting related patent: https://patents.google.com/patent/US20060288281
This is a general problem related to the extractors used, instead of being particularly Parsr's problem.
Attached is a document in spanish that shows that pdfminer cant process latin characters like: á, é, í, ó, ú, ñ, etc...
caixa-one-page-spanish.pdf