Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.55k stars 633 forks source link

[Question] Add Lang entry to the PDF catalog #236

Closed ggrossetie closed 4 years ago

ggrossetie commented 4 years ago

According to the specification it's possible to define the Lang in the PDF catalog: https://www.adobe.com/content/dam/acom/en/devnet/pdf/pdfs/PDF32000_2008.pdf

I'm using the following code:

pdfDoc.catalog.set(PDFName.of('Lang'), PDFString.of('en'))

When using exiftool, I can see that the language is present:

$ exiftool -a -G1 examples/document/basic-example.pdf           
[ExifTool]      ExifTool Version Number         : 10.80
[System]        File Name                       : basic-example.pdf
[System]        Directory                       : examples/document
[System]        File Size                       : 84 kB
[System]        File Modification Date/Time     : 2019:11:12 15:46:46+01:00
[System]        File Access Date/Time           : 2019:11:12 15:46:46+01:00
[System]        File Inode Change Date/Time     : 2019:11:12 15:46:46+01:00
[System]        File Permissions                : rw-rw-r--
[File]          File Type                       : PDF
[File]          File Type Extension             : pdf
[File]          MIME Type                       : application/pdf
[PDF]           PDF Version                     : 1.7
[PDF]           Linearized                      : No
[PDF]           Page Count                      : 7
[PDF]           Language                        : en
[PDF]           Creator                         : Asciidoctor PDF 1.0.0-alpha.3
[PDF]           Producer                        : Doc Writer
[PDF]           Create Date                     : 1970:01:01 00:00:00Z
[PDF]           Modify Date                     : 1970:01:01 00:00:00Z
[PDF]           Title                           : Document Title
[PDF]           Author                          : Doc Writer
[PDF]           Subject                         : 

But when using Acrobat Reader, the value is empty:

Capture

Am I doing something wrong?

Hopding commented 4 years ago

Hello @Mogztter! I dug into this and discovered the following:

All of this leads me to believe that the inability to view a document's language metadata in Adobe Acrobat Reader DC may be a bug in Acrobat Reader itself. Or perhaps its intended, for some strange reason. But it does not appear to have anything to do with the PDF document itself. Because the issue persists even when viewing the language for a document created by Adobe Acrobat itself, without using any third party libraries.

I hope this helps. If you dig into this any further and discover why Acrobat Reader doesn't render the language field, I'd be interested to know what you find!

ggrossetie commented 4 years ago

Thank you really much for digging into this! Maybe it's a paid feature 😉

But it does not appear to have anything to do with the PDF document itself. Because the issue persists even when viewing the language for a document created by Adobe Acrobat itself, without using any third party libraries.

I'm reassured 👍

One last thing, do you think we should add a tiny function to set the language on the document? Similar to setTitle, setAuthor, setSubject... on the PDFDocument.

Hopding commented 4 years ago

Sure. I'd be willing to accept a PR for a PDFDocument.setLanguage method.

ggrossetie commented 9 months ago

Hey! This issue was reported to me again, and a user was able to provide a PDF where the language displays correctly and is properly recognized: accessible-pdf-example.pdf

When I create a PDF using the following code:

import fs from 'node:fs'
import { PDFDocument, PDFString, rgb, PDFName } from 'pdf-lib'

// Create a new PDFDocument
const pdfDoc = await PDFDocument.create()

pdfDoc.catalog.set(PDFName.of('Lang'), PDFString.of('fr'))

// Serialize the PDFDocument to bytes (a Uint8Array)
const pdfBytes = await pdfDoc.save()

fs.writeFileSync('out.pdf', pdfBytes)

out.pdf

I cannot find the /Lang entry (in plain-text) when I open out.pdf in a text editor. Whereas in accessible-pdf-example.pdf I can see the following line:

482 0 obj
<</Lang(en)/MarkInfo<</Marked true/Suspects false>>/Metadata 6 0 R/Names 530 0 R/Outlines 12 0 R/PageLabels 230 0 R/Pages 232 0 R/StructTreeRoot 21 0 R/Type/Catalog/ViewerPreferences 531 0 R>>
endobj

Maybe the /Lang must not be encoded and written in plain-text to maximize compatibility across PDF reader? Does it make sense?

@Hopding I can open a new issue if needed.