Hopding / pdf-lib

Create and modify PDF documents in any JavaScript environment
https://pdf-lib.js.org
MIT License
6.84k stars 652 forks source link

PDFString supports only one-byte characters #1649

Open m-kemarskyi opened 3 months ago

m-kemarskyi commented 3 months ago

What were you trying to do?

I was trying to add a comment to PDF with cyrillic letters.

How did you attempt to do it?

const commentAnnotRef = this.pdfDocument.context.register(
  this.pdfDocument.context.obj({
    Type: 'Annot',
    Subtype: 'Text',
    Open: true,
    Name: 'Comment', // Determines the icon to place in the document
    T: PDFString.of('abc абві äüöß'), // Comment title
    Contents: PDFString.of('abc абві äüöß'), // Comment main text
    // The position of the annotation
    Rect: [
      xCoordinate,
      pageHeight - yCoordinate,
      xCoordinate,
      pageHeight - yCoordinate,
    ],
  })
)

What actually happened?

It turned out that one-byte per characters is used under the hood (see the result on the screenshot)

Screenshot 2024-07-04 at 13 45 17

What did you expect to happen?

I expected UTF-8 characters to work correctly.

How can we reproduce the issue?

Try to add the comment to PDF file using the code I've provided

Version

1.17.1

What environment are you running pdf-lib in?

Node

Checklist

Additional Notes

No response

m-kemarskyi commented 3 months ago

I've tried to come up with the custom PDFUnicodeString class but it didn't worked out:

export class PDFUnicodeString extends PDFObject {
  // The PDF spec allows newlines and parens to appear directly within a literal
  // string. These character _may_ be escaped. But they do not _have_ to be. So
  // for simplicity, we will not bother escaping them.
  static of = (value: string) => new PDFUnicodeString(value);

  private readonly value: string;

  private constructor(value: string) {
    super();
    this.value = value;
  }

  asBytes(): Uint8Array {
    return new TextEncoder().encode(this.value)
  }

  asString(): string {
    return this.value;
  }

  clone(): PDFUnicodeString {
    return PDFUnicodeString.of(this.value);
  }

  toString(): string {
    return `(${this.value})`;
  }

  sizeInBytes(): number {
    return new TextEncoder().encode(this.value).length + 2;
  }

  copyBytesInto(buffer: Uint8Array, offset: number): number {
    buffer[offset++] = 40;
    const encodedValue = new TextEncoder().encode(this.value);
    buffer.set(encodedValue, offset);
    offset += encodedValue.length;
    buffer[offset++] = 41;

    return encodedValue.length + 2;
  }
}
m-kemarskyi commented 2 months ago

UPD: PDFHexString class solves the problem: PDFHexString.fromText(YOUR_TEXT)