aws-samples / amazon-textract-transformer-pipeline

Post-process Amazon Textract results with Hugging Face transformer models for document understanding
MIT No Attribution
88 stars 25 forks source link

Japanese text not rendering in A2I review UI #30

Closed athewsey closed 1 year ago

athewsey commented 1 year ago

When processing PDFs containing digital text in Japanese (for e.g. save this AWS JP ML blog post as PDF and upload to the pipeline), Japanese text is not rendering on the document in the A2I review UI: Only latin text gets preserved with blank space where the Japanese text should be.

In the browser console there are repeated messages like:

Warning: loadFont - translateFont failed: "Error: fetchBuiltInCMap: failed to fetch file "undefinedAdobe-Japan1-UCS2.bcmap" with "Forbidden".".
Warning: Error during font loading: fetchBuiltInCMap: failed to fetch file "undefinedAdobe-Japan1-UCS2.bcmap" with "Forbidden".".

It looks like we may need to enable character mapping in the getDocument call, to allow mapping/translating character sets into available fonts?