Alfresco / alfresco-transform-core

GNU Lesser General Public License v3.0
15 stars 24 forks source link

Unicode transfromation from txt to pdf not supported #152

Open justemu opened 5 years ago

justemu commented 5 years ago

T-Engine: transform-misc Version: 2.1.0 Task : transformation of a txt file (unicode) to pdf failed Error Information:

31 Oct 2019 11:16:56 510 txt pdf ERROR 本文档库子文件夹权限设置规范 688 bytes 274 ms textToPdf Failed 09311139 textToPdf returned a 400 status Miscellaneous Transformers - U+76EE ('.notdef') is not available in this font Helvetica encoding: WinAnsiEncoding http://transform-misc:8090/transform targetExtension=pdf sourceMimetype=text/plain sourceExtension=txt targetMimetype=application/pdf

Analysis: The text file contains Unicode characters, which lead to the error. I have confirmed this error with other Unicode characters, resulting to the same error.

How to improve unicode compatibility of the transform services? Switch the font file or change the WinAnsiEncoding?

justemu commented 5 years ago

The Problem should be font "Helvetica" does not contain Asian unicode characters. The solution should be replace the font "Helvetica" with "FZLTXHJW.TTF".
The font "FZLTXHJW.TTF" is a super-collection of "Helvetica" -- Both the western alphabets and asian characters.

I have uploaded the font file as an attachment. Could somebody help fix the source code?

FZLTXHJW.TTF.zip

ariksidney commented 3 years ago

It's probably linked to this issue I discovered yesterday: https://github.com/Alfresco/alfresco-docker-base-java/issues/60

montgolfiere commented 3 years ago

Thanks @ariksidney

I'm not sure if OP is using docker image or not. JFYI: T-Core 2.1.0 was based on CentOS 7 (rather than CentOS 8).

https://github.com/Alfresco/alfresco-transform-core/blob/2.1.0/alfresco-docker-transform-misc/Dockerfile

In any case, we just released T-Core 2.5.1 which should have the new updated Java Base Image (11.0.11 / CentOS 8 - including the UTF-8 fix on CentOS 8).

hi-ko commented 2 years ago

This seems to be all somehow related: MNT-22398 Transform Services AIO Engine Not Handling CSV Files with umlauts

This time WinAnsiEncoding is the issue.

I checked a work around forcing to pick libreoffice by overwriting the textToPdf transfomer in shared/classes/alfresco/extension/transform/pipelines without pdf as targetMediaType in textToPdf transformer. Now I get the same exception as described in MNT-22398 from the libreoffice transformer:

Caused by: org.alfresco.error.AlfrescoRuntimeException: 03050020 libreoffice returned a 400 status All in One Transformer - U+FEFF ('zerowidthjoiner') is not available in the font Helvetica, encoding: WinAnsiEncoding http://localhost:8090/transform targetExtension=pdf sourceEncoding=UTF-8 sourceMimetype=text/csv sourceExtension=csv targetMimetype=application/pdf
Apr 05 09:37:09 alf-test-72 alfresco[44731]:         at org.alfresco.repo.content.transform.RemoteTransformerClient.request(RemoteTransformerClient.java:193)
Apr 05 09:37:09 alf-test-72 alfresco[44731]:         at org.alfresco.repo.content.transform.RemoteTransformerClient.request(RemoteTransformerClient.java:99)
Apr 05 09:37:09 alf-test-72 alfresco[44731]:         at org.alfresco.repo.content.transform.LocalTransformImpl.transformImpl(LocalTransformImpl.java:193)
Apr 05 09:37:09 alf-test-72 alfresco[44731]:         at org.alfresco.repo.content.transform.AbstractLocalTransform.transformWithDebug(AbstractLocalTransform.java:160)
Apr 05 09:37:09 alf-test-72 alfresco[44731]:         ... 11 more
hi-ko commented 2 years ago

I suggest to rename the ticket to something like: transfromation from txt to pdf does not support common encodings

I testet without docker in ubuntu 20.04 having locale en_US.UTF-8