Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.12k stars 682 forks source link

Version 61.2 PDFs with form input are unreasonably HUGE #2119

Closed Salamek closed 5 months ago

Salamek commented 6 months ago

Hi, i was using Weasyprint 54.2 with my own custom extension (weasyform==0.0.7) to add support for signature field, my empty PDF with signature field has size ~1.2KiB and was working fine.

Few days back i have migrated to Weasyprint 61.2 that have form support that my extension was providing integrated except it is missing signature field (PR incoming soon) so i have created finisher (weasyform==0.0.10) adding that but now the same PDF have size of 362.2KiB!!!

I did some tests and this size issue is not coming from my code (size is ~same when not using the finisher, just input field is not converted to signature field), this only happens when using form input/s:

Weasyprint 54.2 test code:

requirements.txt

weasyprint==54.2
weasyform==0.0.7

main.py

from weasyform.FormFinisher import FormFinisher
from weasyform import HTML

html = """
<style>
    .signature {
        display: block;
        width: 100%;
        height: 100px;
        border: 1px solid black;
        appearance: auto;
    }
</style>

<input type="signature" class="signature" name="signature_employee">"""

pdf = HTML(string=html).render()
pdf.write_pdf(
    finisher=FormFinisher(inject_empty_cryptographic_signature=False),
    target='out54.pdf',
)

Weasyprint 61.2 test code:

requirements.txt

weasyprint==61.2
weasyform==0.0.10

main.py

from weasyprint import HTML
from weasyform.FormFinisher import FormFinisher

html = """
<style>
    .signature {
        display: block;
        width: 100%;
        height: 100px;
        border: 1px solid black;
        appearance: auto;
    }
</style>

<input type="signature" class="signature" name="signature_employee">"""

pdf = HTML(string=html).render()
pdf.write_pdf(
    finisher=FormFinisher(inject_empty_cryptographic_signature=False), 
    target='out61.pdf',
)
pdf.write_pdf(
    target='out61-no-signature.pdf',
)

Here are generated files:

out54.pdf (1.2KiB) out61.pdf (362.2 KiB) out61-no-signature.pdf(362.1 KiB)

As you can see, my finisher adds only 0.1 KiB in size, while Weasyprint 61.2 adds 361KiB vs Weasyprint 54.2, from where is this bloat coming from? When looking on 61.2 pdfs in editor that added bloat is in binary form, also debuging the file in pdfbox did not reveal anything pointing out what this added bloat is, only thing i have found is one added font (Zapf Dingbats Regular) but that should add 50KiB max... so where are those 300KiB? From what i can see in https://github.com/Kozea/WeasyPrint/blob/78d864bdc40716b43d47b778177de43da16ddb8d/weasyprint/pdf/anchors.py#L94 There is some font added to prob render the fields correctly? Should not that be responsibility of PDF reader to style these?

Also input_name is packed in pydyf.String twice, here: https://github.com/Kozea/WeasyPrint/blob/78d864bdc40716b43d47b778177de43da16ddb8d/weasyprint/pdf/anchors.py#L119

and then in every field Dictionary def:

'T': pydyf.String(input_name),

I don't think that is correct?

PS: You can hide whole app in those 361KiBs... (i'm kinda sus after whole xz thing...)

liZe commented 5 months ago

Hi!

There is some font added to prob render the fields correctly? Should not that be responsibility of PDF reader to style these?

Yes, the PDF creator has to include the font, and the extra kilobytes are caused by this font.

When a text input field is included, there’s a font associated to it, the text you write in the field will be displayed with this font in the PDF. By default, WeasyPrint removes characters that are not in a PDF: if your PDF only contains the "ABC" characters, there’s no need to include all the other characters of the font. But when there’s an input field, we don’t know which text will be written in it. So we have to include the whole font.

Your 1.2kb file works because your PDF reader uses a fallback font. You can try to use a specific font for your document, just add the ABC text before your input, here’s what you get if you don’t include the whole font (ie. with 54.2):

Capture d’écran du 2024-04-09 09-06-58

ABC is displayed with the right font, DEF is displayed with a fallback font.

Here’s what I get with 61.2:

Capture d’écran du 2024-04-09 09-10-23

That’s better! But the PDF is a bit larger because it includes the whole font.

You can search for the used_in_forms attribute in the code to understand how it works.

Salamek commented 5 months ago

Hmm so i need to rewrite add_input to not include a font when no text input field is used, also what about this:

input_name = pydyf.String(element.attrib.get('name', default_name))  

'T': pydyf.String(input_name),

This is corect? should this not be just:

'T': input_name,

?

liZe commented 5 months ago

This is corect? should this not be just:

Thanks for the report, it’s fixed (it didn’t generate broken PDFs, but it was useless).

Hmm so i need to rewrite add_input to not include a font when no text input field is used

It works if you don’t set used_in_forms=True, as it’s done for checkboxes.