J-F-Liu / lopdf

A Rust library for PDF document manipulation.
MIT License
1.67k stars 176 forks source link

Text with special chars does not render correctly in most pdf readers #327

Closed ctaque closed 2 months ago

ctaque commented 2 months ago

https://stackoverflow.com/questions/78983568/question-about-special-characters-and-font-rendering-in-pdf?noredirect=1#comment139263849_78983568

I'm adding text to a pdf using the Tj operator with Courier Font and encoding the text in hexadecimal a such documented :


/* function provided by lopdf */
pub fn encode_utf16_be(text: &str) -> Vec<u8> {
    // Prepend BOM to the mark string as UTF-16BE encoded.
    let bom: u16 = 0xFEFF;
    let mut bytes = vec![];
    bytes.extend([bom].iter().flat_map(|b| b.to_be_bytes()));
    bytes.extend(text.encode_utf16().flat_map(|b| b.to_be_bytes()));
    bytes
}

fn write_text_to_pdf(
    mut doc: Document,
    page: u32,
    text: String,
    position: (f32, f32),
    cmyb: (f32, f32, f32, f32),
    font_size: f32,
    font: &str,
) -> Result<Document, ErrorResponse> {
    let encoded_text: Vec<u8> = encode_utf16_be(text.as_str());
    dbg!(encoded_text.clone());
    let content = vec![
        Operation::new("BT", vec![]), // Begin text object
        Operation::new(
            "k",
            vec![
                Object::Real(cmyb.0), // Cyan
                Object::Real(cmyb.1), // Magenta
                Object::Real(cmyb.2), // Yellow
                Object::Real(cmyb.3), // Black
            ],
        ),
        Operation::new("Tc", vec![Object::Real(-1.5)]),
        Operation::new("Tf", vec![font.into(), font_size.into()]),
        Operation::new(
            "Td",
            vec![Object::Real(position.0), Object::Real(position.1)],
        ),
        Operation::new(
            "Tj",
            vec![Object::String(
                encoded_text.clone(),
                StringFormat::Hexadecimal,
            )],
        ),
        Operation::new("ET", vec![]), // End text object
    ];

    let pages = doc.get_pages();
    let mb_page_id = pages.get(&page);

    if let Some(page_id) = mb_page_id {
        let content_stream = doc.get_page_content(*page_id).unwrap();
        let mut original_content = Content::decode(&content_stream).unwrap();
        original_content.operations.extend_from_slice(&content);
        let modified_content = Content::encode(&original_content).unwrap();

        let _ = doc.change_page_content(*page_id, modified_content);

        Ok(doc)
    } else {
        Err(ErrorResponse::new(Some("page does not exist".to_string())))
    }
}

The document rendering is correct in brave-browser, special chars like é, è or ° are corrects.

However, The text added in the pdf in Firefox or Evince or Okular is displayed with rectangles.

I tried ghostscript to embed the fonts in a post process.

pdffonts before ghostscript:

/tmp » pdffonts XXwvcj9eEkrqKts1Cmhf.pdf                                                                                                                                                                           
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
XHDUSZ+NunitoSans-Bold               TrueType          WinAnsi          yes yes yes    201  0
NSDMIT+NunitoSans-Black              TrueType          WinAnsi          yes yes yes    398  0
TJYROH+NunitoSans-Regular            TrueType          WinAnsi          yes yes yes    396  0
BIXNWR+NunitoSans-ExtraBold          TrueType          WinAnsi          yes yes yes    385  0
BVZSWR+NunitoSans-Regular            CID TrueType      Identity-H       yes yes yes    621  0
NSDMIT+NunitoSans-Black              TrueType          WinAnsi          yes yes yes    614  0
BIXNWR+NunitoSans-ExtraBold          TrueType          WinAnsi          yes yes yes    601  0
TJYROH+NunitoSans-Regular            TrueType          WinAnsi          yes yes yes    612  0
PLYTKP+NunitoSans-SemiBoldItalic     TrueType          WinAnsi          yes yes yes    620  0
NSDMIT+NunitoSans-Black              TrueType          WinAnsi          yes yes yes    829  0
BIXNWR+NunitoSans-ExtraBold          TrueType          WinAnsi          yes yes yes    816  0
XHDUSZ+NunitoSans-Bold               TrueType          WinAnsi          yes yes yes    849  0
HZXXCF+NunitoSans-SemiBold           CID TrueType      Identity-H       yes yes yes   1054  0
BIXNWR+NunitoSans-ExtraBold          TrueType          WinAnsi          yes yes yes   1033  0
XHDUSZ+NunitoSans-Bold               TrueType          WinAnsi          yes yes yes   1065  0
NSDMIT+NunitoSans-Black              TrueType          WinAnsi          yes yes yes   1046  0
TJYROH+NunitoSans-Regular            TrueType          WinAnsi          yes yes yes   1044  0
------------------------

pdffonts after ghostscript:

/tmp » pdffonts test.pdf                                                                                                                                                                           
name                                 type              encoding         emb sub uni object ID
------------------------------------ ----------------- ---------------- --- --- --- ---------
    let _out = Command::new("ghostscript")
        .args(&[
            "-o",
            output.to_string().as_str(),
            "-sDEVICE=pdfwrite",
            "-dPDFSETTINGS=/prepress",
            "-dEmbedAllFonts=true",
            "-dSubsetFonts=false",
            "-dCompressFonts=true",
            "-dNOPAUSE",
            "-dBATCH",
            "-dPDFA",
            "-sFONTPATH=/usr/share/fonts",
            "-f",
            input.to_string().as_str(), // Replace with your input PDF
        ])
        .output() // Executes the command and captures output

Output of the ghostscript command when run in the console:

/tmp » ghostscript -o test.pdf -sDEVICE=pdfwrite -dPDFSETTINGS=/prepress -dEmbedAllFonts=true -dSubsetFonts=false -dCompressFonts=true -dNOPAUSE -dBATCH -dPDFA -sFONTPATH=/usr/share/fonts -f jijEq9kwnLjgx4krtXso.pdf
GPL Ghostscript 10.02.1 (2023-11-01)
Copyright (C) 2023 Artifex Software, Inc.  All rights reserved.
This software is supplied under the GNU AGPLv3 and comes with NO WARRANTY:
see the file COPYING for details.
Processing pages 1 through 5.
Page 1
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Page 2
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Page 3
Page 4
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Page 5
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular
Loading font Courier (or substitute) from /usr/share/ghostscript/Resource/Font/NimbusMonoPS-Regular

The following warnings were encountered at least once while processing this file:
    Couldn't find a named resource

   **** This file had errors that were repaired or ignored.
   **** Please notify the author of the software that produced this
   **** file that it does not conform to Adobe's published PDF
   **** specification.

Now it looks almost correct in Firefox, Evince or Okular.

However special characters are missing or incorrect.

Please provide a working exemple with special characters such as é, è or °.

Heinenen commented 2 months ago

encode_utf16_be can only be used for text that isn't rendered as content of a page (notes, table.of content entries). Rendering regular text is more complicated and requires selecting the correct glyphs from the font. I don't know how to do it myself so I sadly can't help you.

ctaque commented 2 months ago

Okay I understand. I will close this since I found a solution using PyPDF2 and pyo3 bindings to python.