page.drawText() inserts spaces when using Thai font

robin-dunn commented 2 years ago

What were you trying to do?

I am trying to use the page.drawText() function to render text in the Thai language

Why were you trying to do this?

To build an application that creates PDF files containing text written in the Thai language

How did you attempt to do it?

The steps I followed are:

Download Google Noto Sans Thai font
Embed the font in the pdf-lib PDF document
Invoke the page.drawText() function passing in the text in Thai

See code example provided in reproduction steps section below.

What actually happened?

The PDF file was successfully created but it seems some large spaces have been inserted into the Thai text in the PDF.

I've copied the text from the PDF and pasted below, notice the strange block characters which have been inserted.

แห่งได้เป􏰀ดขึ􏰁นแล้วในการขยายรถไฟใต้ดินลอนดอนครั􏰁งใหญ่ครั􏰁งแรกในศตวรรษนี

Those strange characters appear visually as large blank spaces in the PDF e.g like this:

แห่งได้เป ดขึ นแล้วในการขยายรถไฟใต้ดินลอนดอนครั งใหญ่ครั งแรกในศตวรรษนี

What did you expect to happen?

I expected the Thai text to be rendered as one continuous string without any strange characters or spaces inserted:

แห่งได้เปดขึนแล้วในการขยายรถไฟใต้ดินลอนดอนครังใหญ่ครังแรกในศตวรรษนี

How can we reproduce the issue?

Create a Node JS project folder e.g. called 'pdf-test'
cd pdf-test
npm init -y
npm i pdf-lib
npm i @pdf-lib/fontkit
Download Noto Sans Thai font from https://fonts.google.com/download?family=Noto%20Sans%20Thai
Unzip the font and copy the TTF file from Noto_Sans_Thai/static/NotoSansThai/NotoSansThai-Regular.ttf, paste the file into the the project folder pdf-test so it can be loaded by the index.js script below
Create a file called index.js and paste the code from below
Run the index.js file using the command node index.js which will create the PDF file containing some Thai text
Use a PDF viewer/browser e.g. Google Chrome to view the rendered PDF
Notice the spacing between some of the Thai text

const fs = require('fs');
const path = require('path');
const { PDFDocument, rgb } = require('pdf-lib');
const fontkit = require('@pdf-lib/fontkit');

(async function run() {

    const pdfDoc = await PDFDocument.create()
    pdfDoc.registerFontkit(fontkit)

    // Font downloaded from https://fonts.google.com/download?family=Noto%20Sans%20Thai
    // See also https://fonts.google.com/noto/specimen/Noto+Sans+Thai?query=thai
    const thaiFontBytes = fs.readFileSync(path.join(__dirname, './NotoSansThai-Regular.ttf'))

    const thaiFont = await pdfDoc.embedFont(thaiFontBytes)
    const page = pdfDoc.addPage()
    const { width, height } = page.getSize()

    const fontSize = 11
    page.drawText('แห่งได้เปิดขึ้นแล้วในการขยายรถไฟใต้ดินลอนดอนครั้งใหญ่ครั้งแรกในศตวรรษนี้', {
        x: 50,
        y: height - 2 * fontSize,
        size: fontSize,
        font: thaiFont,
        color: rgb(0, 0.53, 0.71),
    })

    const pdfBytes = await pdfDoc.save()
    fs.writeFile('thai-test.pdf', pdfBytes, () => console.log('PDF file saved.'))
})()

Version

1.16.0

What environment are you running pdf-lib in?

Node

Required Reading

[X] I have read www.sscce.org.
[X] My report includes a Short, Self Contained, Correct (Compilable) Example.
[X] I have read Smart Questions.
[X] I have read 45 GitHub Issues Dos and Don'ts.

Additional Notes

No response

hlab-pawat commented 2 years ago

I also face this problem. I guess the bug is in UnicodeLayoutEngine class in @pdf-lib/fontkit lib.

chacal88 commented 2 years ago

for me the same with many fonts

pfmartins commented 2 years ago

Hey, I see the same issue here. When I write in document, using fonts by google api, sometimes is added an spaces " " in my text. like this:

I'm looking for light 💡

cassilup commented 2 years ago

@tudor-sandu, is this the issue you guys are experiencing?

akomm commented 2 years ago

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

MetheeS commented 2 years ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

akomm commented 2 years ago

The effect in the first post is some bytes added to text outside of valid space for the charset. In PDF if there is no character for that byte-sequence (utf8 is multi-byte with variable length), a reader renders it as a space. While when you copy the text, the actual data with the added bytes is copied and when you paste it in a program that renders non-valid/non-printable "chars" as those "glyphs" (the squares in first post), displaying the data as hex (for example 10F0C1), instead of rendered a space.

Also all the examples and my case does not seem like the font just does not have proper glyph for a character.

I also excluded, that some non-printable bytes in the source beforehand. Its being added when rendering the pdf.

https://unicode-table.com/en/search/?q=10F0C1

https://www.unicode.org/charts/PDF/U100000.pdf Quote:

he Supplementary Private Use Area-B block encompasses the entire range of Plane 16. The range U+100000..U+10FFFD is
entirely designated for private use. The last two code points on the plane, U+10FFFE..U+10FFFF, are designated

noncharacters. Consequently, no character code charts or names lists are provided for the majority of this block, except that

a chart and names list are provided for the last 128 code points, to show the location of the noncharacters

ponnreay commented 2 years ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution is work for font Khmer also.

AgileEduLabs commented 2 years ago

@akomm

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

Try the following await pdfDoc.embedFont(YOURFONT, { features: { liga: false }, });

It definitely is a bug and in my opinion is an issue that should be fixed: https://github.com/Hopding/pdf-lib/issues/490

c-sanchez-fd commented 7 months ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution also works for Calibri fonts

xetadeveloper commented 1 month ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This also worked for Noto Sans Thai, not sure why it works either, but I'll look into this.

Hopding / pdf-lib