Kozea / WeasyPrint

The awesome document factory
https://weasyprint.org
BSD 3-Clause "New" or "Revised" License
7.23k stars 686 forks source link

Per-element metadata from HTML -> PDF -> HTML (via pdf.js) #2279

Closed jambudipa closed 3 weeks ago

jambudipa commented 3 weeks ago

I need to carry some metadata – which could amount to just an ID – from the source HTML, through to the PDF using WeasyPrint, eventually ending up somehow addressable in the HTML rendered by pdf.js (more specifically, react-pdf).

So, for example, if I have this element in my source HTML:

<p class="x00-chapter-title---toc-level" id="contents">Contents</p>

I would like to be able to see that id when rendered in the browser.

It could be any element really: I imagined a data-id would do the trick. I saw this issue and the corresponding solution which comes close to what I need, perhaps I could fork it?

jambudipa commented 3 weeks ago

So I change my element to this:

<p class="x00-chapter-title---toc-level" id="my-id">Contents</p>

Using qpdf, I was able to generate a text-readable version of the generated PDF, and happily found this:

<<
  /Names <<
    /Dests <<
      /Names [
        (my-id)
        [
          25 0 R
          /XYZ
          67.25
          810.889736
          0
        ]
      ]
    >>
  >>
>>

...which gives me hope!

But now I am not sure how to use pdf.js to provide these details, or even tell me what they mean. Presumably coordinates on the page.

Maybe I will ask on the pdf.js GitHub...

jambudipa commented 3 weeks ago

Ok, managed to coerce GPT-4o into giving me the answer:

const page = await pdf.getPage(pageNum);
const pageRef = page.ref; // This contains the object reference for the page
const objectNumber = pageRef.num;
const generationNumber = pageRef.gen;

// Get all named destinations
const destinations = await pdf.getDestinations();