Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.74k stars 646 forks source link

page.drawText() inserts spaces when using Thai font #1010

Open robin-dunn opened 2 years ago

robin-dunn commented 2 years ago

What were you trying to do?

I am trying to use the page.drawText() function to render text in the Thai language

Why were you trying to do this?

To build an application that creates PDF files containing text written in the Thai language

How did you attempt to do it?

The steps I followed are:

See code example provided in reproduction steps section below.

What actually happened?

The PDF file was successfully created but it seems some large spaces have been inserted into the Thai text in the PDF.

I've copied the text from the PDF and pasted below, notice the strange block characters which have been inserted.

แห่งได้เป􏰀ดขึ􏰁นแล้วในการขยายรถไฟใต้ดินลอนดอนครั􏰁งใหญ่ครั􏰁งแรกในศตวรรษนี

Those strange characters appear visually as large blank spaces in the PDF e.g like this:

แห่งได้เป ดขึ นแล้วในการขยายรถไฟใต้ดินลอนดอนครั งใหญ่ครั งแรกในศตวรรษนี

What did you expect to happen?

I expected the Thai text to be rendered as one continuous string without any strange characters or spaces inserted:

แห่งได้เปดขึนแล้วในการขยายรถไฟใต้ดินลอนดอนครังใหญ่ครังแรกในศตวรรษนี

How can we reproduce the issue?

const fs = require('fs');
const path = require('path');
const { PDFDocument, rgb } = require('pdf-lib');
const fontkit = require('@pdf-lib/fontkit');

(async function run() {

    const pdfDoc = await PDFDocument.create()
    pdfDoc.registerFontkit(fontkit)

    // Font downloaded from https://fonts.google.com/download?family=Noto%20Sans%20Thai
    // See also https://fonts.google.com/noto/specimen/Noto+Sans+Thai?query=thai
    const thaiFontBytes = fs.readFileSync(path.join(__dirname, './NotoSansThai-Regular.ttf'))

    const thaiFont = await pdfDoc.embedFont(thaiFontBytes)
    const page = pdfDoc.addPage()
    const { width, height } = page.getSize()

    const fontSize = 11
    page.drawText('แห่งได้เปิดขึ้นแล้วในการขยายรถไฟใต้ดินลอนดอนครั้งใหญ่ครั้งแรกในศตวรรษนี้', {
        x: 50,
        y: height - 2 * fontSize,
        size: fontSize,
        font: thaiFont,
        color: rgb(0, 0.53, 0.71),
    })

    const pdfBytes = await pdfDoc.save()
    fs.writeFile('thai-test.pdf', pdfBytes, () => console.log('PDF file saved.'))
})()

Version

1.16.0

What environment are you running pdf-lib in?

Node

Required Reading

Additional Notes

No response

hlab-pawat commented 2 years ago

I also face this problem. I guess the bug is in UnicodeLayoutEngine class in @pdf-lib/fontkit lib.

chacal88 commented 2 years ago

for me the same with many fonts

pfmartins commented 2 years ago

Hey, I see the same issue here. When I write in document, using fonts by google api, sometimes is added an spaces " " in my text. like this: image

I'm looking for light 💡

cassilup commented 2 years ago

@tudor-sandu, is this the issue you guys are experiencing?

akomm commented 2 years ago

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

MetheeS commented 2 years ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

akomm commented 2 years ago

The effect in the first post is some bytes added to text outside of valid space for the charset. In PDF if there is no character for that byte-sequence (utf8 is multi-byte with variable length), a reader renders it as a space. While when you copy the text, the actual data with the added bytes is copied and when you paste it in a program that renders non-valid/non-printable "chars" as those "glyphs" (the squares in first post), displaying the data as hex (for example 10F0C1), instead of rendered a space.

Also all the examples and my case does not seem like the font just does not have proper glyph for a character.

I also excluded, that some non-printable bytes in the source beforehand. Its being added when rendering the pdf.

https://unicode-table.com/en/search/?q=10F0C1

https://www.unicode.org/charts/PDF/U100000.pdf Quote:

he Supplementary Private Use Area-B block encompasses the entire range of Plane 16. The range U+100000..U+10FFFD is
entirely designated for private use. The last two code points on the plane, U+10FFFE..U+10FFFF, are designated

noncharacters. Consequently, no character code charts or names lists are provided for the majority of this block, except that

a chart and names list are provided for the last 128 code points, to show the location of the noncharacters
ponnreay commented 2 years ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution is work for font Khmer also.

AgileEduLabs commented 2 years ago

@akomm

same here with helvetica neue roman and helvetica neue condensed It inserts spaces, for example after the sequence of fi, but not after i or f by itself. For example Backoffice becomes Backoffi ce and fifi becomes fi fi

Try the following await pdfDoc.embedFont(YOURFONT, { features: { liga: false }, });

It definitely is a bug and in my opinion is an issue that should be fixed: https://github.com/Hopding/pdf-lib/issues/490

c-sanchez-fd commented 7 months ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This solution also works for Calibri fonts

xetadeveloper commented 1 month ago

(for Thai font) the issue can be resolved when we use.embedFont(fontBytes, { subset: true }); Don't know why this help.

This also worked for Noto Sans Thai, not sure why it works either, but I'll look into this.