Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.16k stars 680 forks source link

Data after comma in meta data is removed #1584

Closed Spenhouet closed 2 years ago

Spenhouet commented 2 years ago

I'm unable to explain what is going on here. I'm generating a PDF via a HTML template and in the header I'm including the PDF meta data "keywords".

<head>
    <meta name=keywords content='{
          "0": [38, 16],
          "1": [38, 16],
          "2": [38, 16],
          "3": [38, 16],
          "4": [38, 16],
          "5": [38, 16],
          "6": [38, 16],
    }'>
</head>

But when I check the PDF, the actual stored meta data is:

b'{\n        "0": [38, 16], "1": [38, "2": [38, "3": [38, "4": [38, "5": [38, "6": [38, }'

As you can see this is far from what I provided. One thing is the change in line breaks. But ignoring this, why does it not store the 16], part? Where does this go?

If I change the content to (notice the different number per line):

<meta name=keywords content='{
        "0": [38, 11],
        "1": [38, 12],
        "2": [38, 13],
        "3": [38, 14],
        "4": [38, 15],
        "5": [38, 16],
        "6": [38, 17],
    }'>

Then suddenly the output is correct:

b'{\n        "0": [38, 11], "1": [38, 12], "2": [38, 13], "3": [38, 14], "4": [38, 15], "5": [38, 16], "6": [38, 17], }'

Any idea what is going on?

liZe commented 2 years ago

Hello!

The keywords are a set of fields separated by a comma. WeasyPrint removes the duplicates, that’s why the first example remove the extra 16] values.

So… You shouldn’t use commas if you want a single keyword!

Spenhouet commented 2 years ago

I want to store a JSON in the PDF meta data. So this will make it hard. Is there another way to do this? It seems it is not possible to set XMP data with weasyprint? Any other option?

liZe commented 2 years ago

You can use another field that’s not comma-separated, such as "description".

Otherwise, you can attach the json files to the PDF.

Spenhouet commented 2 years ago

Thanks @liZe for helping me figuring this out. I guess doing it via attachments is cleaner.

I included the file like this:

pdf = HTML(string=finished_html).write_pdf(
            optimize_size=('fonts', 'images'),
            attachments=[Attachment(filename=temppath / 'data.json')])

but how do I read this now? We are using PyPDF4 for some things so I tried it with that as explained here: https://kevinmloeffler.com/2018/07/08/how-to-extract-pdf-file-attachments-using-python-and-pypdf2/

This is looking for the file name here:

catalog = pdf.trailer["/Root"]
fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']

But there is no /EmbeddedFiles. You know where I need to look for the attachments (given weasyprint stored the attachment correctly)?

If one is not overly familiar with the PDF spec, this is really hard to navigate. Thanks for any help!

Spenhouet commented 2 years ago

I can not get the attachments to work. They also do not show up in Adobe Acrobat.

I tried with filename, file file_obj and string (where string does not work at all with Attachment). I'm probably missing something or should I open a new bug report?

liZe commented 2 years ago

Your example works for me, and so does the PyPDF2 script. I get an "attachment0" file with the content of the json.

Here is my PDF sample, can you find the attachment included?

Spenhouet commented 2 years ago

Interesting, it did not work for me at all. Could you share a MWE so that I can run 1:1 the same code on my side and see if your code works here too?

liZe commented 2 years ago

Here’s the PyPDF2 script I used (from your link):

import PyPDF2

def getAttachments(reader):
    catalog = reader.trailer["/Root"]
    fileNames = catalog['/Names']['/EmbeddedFiles']['/Names']
    attachments = {}
    for f in fileNames:
        if isinstance(f, str):
            name = f
            dataIndex = fileNames.index(f) + 1
            fDict = fileNames[dataIndex].getObject()
            fData = fDict['/EF']['/F'].getData()
            attachments[name] = fData
    return attachments

handler = open('meta.pdf', 'rb')
reader = PyPDF2.PdfFileReader(handler)
dictionary = getAttachments(reader)
print(dictionary)
for fName, fData in dictionary.items():
    print(fName)
    with open(fName, 'wb') as outfile:
        outfile.write(fData)

And here’s the WeasyPrint script:

from pathlib import Path
from weasyprint import HTML, Attachment

pdf = HTML('meta.html').write_pdf(
    optimize_size=('fonts', 'images'),
    attachments=[Attachment(filename=Path('meta.js'))])
open('meta.pdf', 'wb').write(pdf)
liZe commented 2 years ago

Don’t hesitate to reopen if needed!