elabftw / elabftw

:notebook: eLabFTW is the most popular open source electronic lab notebook for research labs.
https://www.elabftw.net
GNU Affero General Public License v3.0
1.01k stars 218 forks source link

CJK characters are printed as boxes in the main text in pdf files #3418

Closed whooooong closed 1 year ago

whooooong commented 2 years ago

Describe the bug

In the pdf files generated, the Chinese characters are printed as boxes (▯) in main text, while those in the titles and the names of linked items are printed correctly. An example shown in the web page: 7VQLFTAVDG)6Q5_T5P%XCHN

The example shown in the pdf: U~8$%4Q345G2R84N$8DR18T

Steps to reproduce

Just put the following Chinese characters in the title and main text, and then try to make a pdf: 试试这几个字能否正确显示

Information

NicolasCARPi commented 2 years ago

Hello, did you check the box "Enable Chinese, Japanese and Korean fonts in PDF generation" in your user control panel under the "PDF options" part?

whooooong commented 2 years ago

Yes. I did. Z`X5BMHCWQT~CM0KB}VX$SM Z@OFBGWLK{7LCP{{QZLO N

NicolasCARPi commented 2 years ago

It seems to work in dev and in 4.2.4 version. Did you try different pdf readers? I'm using firefox to open the pdf 2022-03-24-100226_923x451_scrot .

whooooong commented 2 years ago

I tried PDF-XChange Viewer, firefox, foxit pdf reader and Adobe Acrobat Reader DC. All of these pdf readers show the characters in main text as boxes.
The calibre e-book viewer can show the characters correctly.
As the characters of other regions on the same pdf page are shown correctly, it is probably not a problem with the pdf reader.

NicolasCARPi commented 2 years ago

Check the box for PDF/A so the font is embeded in the PDF. Then it should show it nicely everywhere.

whooooong commented 2 years ago

Also tried, the same, not shown correctly.

Maybe, the main text part is not using correct font.

In the pdf file, the font for the title part is: <</Type /FontDescriptor /FontName /MPDFAA+Sun-ExtA

but the main text part is: <</Type /FontDescriptor /FontName /MPDFAA+DejaVuSansCondensed

NicolasCARPi commented 2 years ago

But in my example it is in the main part (body), not title. I'm attaching the pdf so you can try to open it, too.

with the dev version: 2022-03-24 - cjk-test.pdf

with 4.2.4: 4.2.4.pdf

Also you said it works with calibre, so the way I see it the pdf is alright, but the software you use doesn't have the correct fonts accessible (is the full operating system in chinese?). This would not be a problem with PDF/A, as the font is embedded. But maybe https://github.com/elabftw/elabftw/issues/3211 will help in this issue.

How do you see which part has which font? It is true that what you describe would explain the issue, but there is no reason for a different font for title and body, as the font family is applied to the whole document.

whooooong commented 2 years ago

I found DejaVuSansCondensed in the file: /elabftw/vendor/mpdf/mpdf/src/Config/FontVariables.php Then the characters are shown correctly after DejaVuSansCondensed is replaced by Sun-ExtA: sed -i s/DejaVuSansCondensed.ttf/Sun-ExtA.ttf/g FontVariables.php sed -i s/DejaVuSansCondensed-Bold.ttf/Sun-ExtA.ttf/g FontVariables.php sed -i s/DejaVuSansCondensed-Oblique.ttf/Sun-ExtA.ttf/g FontVariables.php sed -i s/DejaVuSansCondensed-BoldOblique.ttf/Sun-ExtA.ttf/g FontVariables.php


This cannot be reproduced. These changes may make the pdf-making process stuck with a web page showing nothing.

whooooong commented 2 years ago

Open a pdf file with notepad++, we can see the font information.

NicolasCARPi commented 2 years ago

Open a pdf file with notepad++, we can see the font information.

Yeah, that's what I did with vim but I guess I mistyped the search string. Now doing it again, I only have one <</Type /FontDescriptor /FontName /MPDFAA+Sun-ExtA and no DejaVu anywhere.

whooooong commented 2 years ago

Most probably, the font-family styles from the editor make this problem. These styles, if any, override the font-family set in the template. In the ViewSource code , I found there were font-family styles. With the font-family style removed, the characters are shown correctly in pdf. Or, putting sun-exta in the style also works.

Adding a line of $body = preg_replace('#font-family:[^;]+?;#i', '', $body); before line 259 return str_replace('src="app/download.php?f=', 'src="' . dirname(__DIR__, 2) . '/uploads/', $body); in the file /elabftw/src/services/MakePdf.php gets it working as well.

NicolasCARPi commented 2 years ago

I don't quite like this approach, as generating big zip archives and thus a lot of pdfs are already a big resource hog, and I'm afraid that adding such regex will impact performance too much (of course this would need to be tested!).

Another approach would be to disable the Fonts menu from the editor, what are your thoughts on this option?

whooooong commented 2 years ago

Adding such regex is not a good idea, as it removes all font-family settings from the whole body part. The CJK users may also want to keep other font-family settings for non-CJK characters.

Another approach would be to add Sun-ExtA to the Fonts menu, or make it included in font-family style of all lines for CJK users. It works even it is the last one in the font-family style. A check box could be added for choosing a default font to be included as the last one. For non-CJK users, if the sun-exta was the last one of font-family style, it would not be embedded in the pdf, as the fonts ahead is enough to support all the letters.

Without checking PDF/A, the fonts will be embedded as well but only the used subset of the fonts is embedded, the file size could be keep small with Sun-ExtA as only a small subset of the characters are frequently used . A check box could be added to let the users to decide whether to embed the full copy of the entire character set if it is not the whole meaning of PDF/A.

NicolasCARPi commented 2 years ago

Yes, adding a Sun-ExtA as fallback font could also be a valid approach.

whooooong commented 2 years ago

Sounds good. Many thanks!

NicolasCARPi commented 1 year ago

I'm going to close this. The solution is simply to not use a custom font for letters that are CJK characters.