flyingsaucerproject / flyingsaucer

XML/XHTML and CSS 2.1 renderer in pure Java
Other
1.96k stars 549 forks source link

CJK characters not rendered from Unicode font #251

Closed emcintyre-hpe closed 6 months ago

emcintyre-hpe commented 6 months ago

Hello,

It seems that I'm not able to generate a PDF where the HTML content contains mixed western/CJK characters. I'm using a Unicode font (Arial Unicode MS), which contains all of the glyphs, but the resulting PDF does not display the CJK characters. I can't tell if it's dropping them or replacing them with something else.

I have a full example project here: https://github.com/emcintyre-hpe/flying-saucer-cjk. HOWEVER, because it uses a proprietary font, the repo is private. Please let me know which contributors here should have access and I will add them.

The example HTML looks like this:

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
  "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html lang="en" xml:lang="en" xmlns="http://www.w3.org/1999/xhtml">
<head>
  <meta content="text/html"/>
  <style type="text/css">
      body {
          font-family: "ArialUnicodeMS", sans-serif;
      }

      p.cjk {
          font-family: "HeiseiKakuGo-W5-H";
      }
  </style>
  <title>Simple Document</title>
</head>
<body>
<h1>Some Lorem Jeffsum</h1>
<p>Yeah, but John, if The Pirates of the Caribbean breaks down, the pirates don't eat the tourists. God creates
  dinosaurs. God destroys dinosaurs. God creates Man. Man destroys God. Man creates Dinosaurs. I was part of something
  special. Jaguar shark! So tell me - does it really exist?</p>
<p>Checkmate... This thing comes fully loaded. AM/FM radio, reclining bucket seats, and... power windows. Do you have
  any idea how long it takes those cups to decompose. I was part of something special. Eventually, you do plan to have
  dinosaurs on your dinosaur tour, right?</p>
<h2>Some Chinese text from Google Translate</h2>
<p>这东西满载而归。 AM/FM 收音机、斜躺桶形座椅和……电动车窗。 我们必须焚烧雨林、倾倒有毒废物、污染空气、破坏臭氧层!
  因为也许如果我们把这个星球搞得够糟,他们就不再想要它了!</p>
<h2>Some Japanese text from Google Translate</h2>
<p>太った女性のことは忘れてください! あなたは太った女性に夢中です!English text interspersed with Japanese. 私たちをここから追い出してください!
  もっと早く行かなければ...行け、行け、行け、行け、行け! 父がかつて私にこう言いました、笑えば世界もあなたと一緒に笑います、泣きなさい、そうすればこの野郎のことで泣けるようにしてあげますよ!
  最終的には、恐竜ツアーに恐竜を参加させる予定ですよね?</p>
<h2>Some Korean text from Google Translate</h2>
<p>결국에는 공룡 투어에 공룡도 함께 할 계획이시죠? 운이 좋아서 얼음이 없어요. 이게 내 에스프레소 머신인가요? 뭐, 뭐야, 뭐야, 내 에스프레소 머신은 어떻게 샀어? 응, 하지만 존, 캐리비안의 해적이 망하면
  해적들은 관광객을 잡아먹지 않아.</p>
<h2>The Japanese text below uses a dedicated Japanese font</h2>
<p class="cjk">太った女性のことは忘れてください! あなたは太った女性に夢中です! 私たちをここから追い出してください!
  もっと早く行かなければ...行け、行け、行け、行け、行け! 父がかつて私にこう言いました、笑えば世界もあなたと一緒に笑います、泣きなさい、そうすればこの野郎のことで泣けるようにしてあげますよ!
  最終的には、恐竜ツアーに恐竜を参加させる予定ですよね?</p>
</body>
</html>

The transformation code:

import static org.apache.commons.io.IOUtils.resourceToString;

import java.io.IOException;
import java.io.OutputStream;
import java.nio.charset.StandardCharsets;
import java.nio.file.Files;
import java.nio.file.Paths;

import org.xhtmlrenderer.pdf.CJKFontResolver;
import org.xhtmlrenderer.pdf.ITextFontResolver;
import org.xhtmlrenderer.pdf.ITextRenderer;
import org.xhtmlrenderer.simple.xhtml.XhtmlNamespaceHandler;

import com.lowagie.text.pdf.BaseFont;

/**
 * Run with {@code mvn compile exec:java -Dexec.mainClass="CjkExample"}
 */
public class CjkExample {
  public static void main(String[] args) throws IOException {
    // Create renderer
    ITextRenderer pdfRenderer = initRenderer();
    // Create output stream
    try (OutputStream pdf = Files.newOutputStream(Paths.get("target/example.pdf"))) {
      // Read HTML file
      String html = resourceToString("/example.html", StandardCharsets.UTF_8);
      pdfRenderer.setDocumentFromString(html);
      pdfRenderer.layout();
      pdfRenderer.createPDF(pdf, true);
    }
  }

  private static ITextRenderer initRenderer() throws IOException {
    ITextFontResolver fonts = new CJKFontResolver();
    ITextRenderer renderer = new ITextRenderer(fonts);
    fonts.addFont("/fonts/ArialUnicodeMS.ttf", "ArialUnicodeMS", BaseFont.CP1252, true, null);
    renderer.getSharedContext().setNamespaceHandler(new XhtmlNamespaceHandler());
    return renderer;
  }
}

And here is the resulting PDF

asolntsev commented 6 months ago

@emcintyre-hpe Please grant me access to your private repo.

emcintyre-hpe commented 6 months ago

@asolntsev Done!

asolntsev commented 6 months ago

@emcintyre-hpe I am not an expert in CJK fonts, but at least I can confirm two things:

  1. This problem is also reproducible with FlyingSaucer 9.2.2 (the last version before my refactoring of CJK fonts). So it's not a new issue.
  2. I tried to enable debug logs, and I see that FlyingSaucer can resolve all the asked fonts:
22:15:49:549 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400/640.0
22:15:49:550 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400
22:15:49:550 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400/640.0
22:15:49:555 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/normal/normal: Font ArialUnicodeMS/400/320.0
22:15:49:555 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/normal/normal: Font ArialUnicodeMS/400
22:15:49:555 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/normal/normal: Font ArialUnicodeMS/400/320.0
22:15:49:556 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400
22:15:49:557 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400/480.0
22:15:49:557 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400
22:15:49:557 [main] DEBUG ITextFontResolver - Resolved font ArialUnicodeMS/bold/normal: Font ArialUnicodeMS/400/480.0
22:15:49:560 [main] DEBUG ITextFontResolver - Resolved font HeiseiKakuGo-W5-H/normal/normal: Font HeiseiKakuGo-W5/400/320.0
22:15:49:560 [main] DEBUG ITextFontResolver - Resolved font HeiseiKakuGo-W5-H/normal/normal: Font HeiseiKakuGo-W5/400
22:15:49:560 [main] DEBUG ITextFontResolver - Resolved font HeiseiKakuGo-W5-H/normal/normal: Font HeiseiKakuGo-W5/400/320.0

So I assume it's some problem with your html or font.

emcintyre-hpe commented 6 months ago

Thanks for looking into it @asolntsev. I've tried everything I can think of, including changing the CJK characters to HTML entities, but it always comes out the same. As far as the font, I'm 99.9% certain it contains at least some of the characters I'm trying to render. So I really can't pinpoint where the characters are getting lost or mis-translated.

emcintyre-hpe commented 6 months ago

Oh my gosh, I swear I tried this already and it had no effect! The encoding of the font has to be BaseFont.IDENTITY_H:

    fonts.addFont("/fonts/ArialUnicodeMS.ttf", "ArialUnicodeMS", BaseFont.IDENTITY_H, true, null);

With that set, the characters are correctly rendered. 🤦

@asolntsev , please let me know if it would be appropriate to add something about this to the documentation or examples. I'd be happy to contribute a PR.

asolntsev commented 6 months ago

Thank you for sharing! Yes, we could describe it in documentation or Readme.

Or we could even create a test showing how to use cjk font for generating pdf.