Arabic and hebrew texts not supporting

Srimathi-Thirumoorthy commented 5 months ago

mohamnag commented 4 months ago

what kind of help is required here? I'm a Frasi native speaker that can help change code and verify the results, however I'm not really sure if I know what piece of code is to be changed here. I took a short look at the latest version and can't really spot the place where the drawing of an element with unicode text is happening.

mohamnag commented 4 months ago

FYI, I tracked it down to this method com.lowagie.text.pdf.BaseFont#convertToBytes(java.lang.String) and it looks like the encoding is always set to Cp1252 from which I would not expect much to render any non-latin chars. maybe properly setting the charset on that (don't know how) will fix the issue. eventually using a font that has proper characters too.

asolntsev commented 4 months ago

@mohamnag Hi. Wow, thank you for debugging this problem with fonts. Yes, now I see: FS always uses encoding winansi (which I guess means Cp1252). I don't know why, but it was used from the very beginning 01.02.2006 :)

I think we can change this encoding. Can you provide a simple example of such html and font, so we could add this example to FS tests?

mohamnag commented 4 months ago

well I went on and used a custom font where I can set the encoding. the result was unfortunately still problematic.

lets take this sample HTML:

<html lang="fa">
<head>
    <meta charset="UTF-8"/>
    <title>Title</title>
    <style>
        .rtl-font {
            font-family: Vazirmatn;
            direction: rtl;
        }
    </style>
</head>
<body>
<div style="background-color: blue">
    تست فارسی
</div>
<div class="rtl-font" style="background-color: green">
    تست فارسی
</div>
<div dir="rtl" style="background-color: red; font-family: Vazirmatn">
    تست فارسی
</div>
</body>
</html>

I have the font (can get it for free from https://github.com/rastikerdar/vazirmatn/releases/tag/v33.003) unzipped into resources directory and this is my Java code:

        try (OutputStream outputStream = new FileOutputStream("build/pdf/method4.pdf")) {
            // parse and improve HTML
            Document document = Jsoup.parse(new File(inputHtml.getFile()), "UTF-8");
            document.outputSettings().syntax(Document.OutputSettings.Syntax.xml);
            var htmlString = document.html();

            // initialize Flying Saucer
            ITextRenderer renderer = new ITextRenderer();
            SharedContext sharedContext = renderer.getSharedContext();
            sharedContext.setPrint(true);
            sharedContext.setInteractive(false);

            renderer
                    .getFontResolver()
                    .addFont(
                            Main.class.getClassLoader().getResource("Vazirmatn/ttf/Vazirmatn-Regular.ttf").toString(),
                            BaseFont.IDENTITY_H,
                            true
                    );

            renderer.setDocumentFromString(htmlString);

            renderer.layout();
            renderer.createPDF(outputStream);
            // relative resources: see https://www.baeldung.com/java-html-to-pdf#dependencies-4
        }

now this is the output that FS is giving me:

and this is what a browser gives me (ignoring the font not being applied):

there are two problems here:

the connection between letters: farsi/arabic letters get connected and change shape based on position and neighbouring letters. this is somehow not handled
the RTL orientation is not applied. the first letter ت should be positioned right most but is left most.

in general I would first go for solving this problem using a custom font (which for sure has all chars) and then maybe looking into fixing that charset for default font.

mohamnag commented 4 months ago

btw, you have probably seen this example of RTL rendering using OpenPDF but I just to mention it: https://github.com/LibrePDF/OpenPDF/blob/master/pdf-toolbox/src/test/java/com/lowagie/examples/fonts/styles/RightToLeft.java

I don't know if this is different than what FS is doing under the hood when working with OpenPDF but I couldn't find any of those methods being called.

mohamnag commented 4 months ago

I also found this post: https://groups.google.com/g/flying-saucer-users/c/n0CfuYfpQ6I/m/3iJIaZ4IAAAJ and a whole thread there that is related to this ticket.

flyingsaucerproject / flyingsaucer

Arabic and hebrew texts not supporting #270