Can I use mupdf.js to get acutal font name, and modify text content before they were rendered to html elements ?

longnight commented 3 weeks ago

In PyMuPdf I can do this, on pdf files that with text layer:

text_dict = page.get_text("dict")
    for bl in text_dict['blocks']:
        for line in bl.get('lines', []):
            for span in line.get('spans', []):
                print(span.get('font'))    //  here I got the actual font name

But in PDF.js, it transfer/change font name to internal identifier likes "g_d0_f18" . Now in mupdf.js , can I extract these text blocks, with actual font name as py script did ?

And question sencond, still for pdf with text layer:

Can I replace/modify some text content before they were rendered into page/html elements in the viewer ? I need to replace some sepecial symbols(they were set in special custom font) into other characters , then when others select then copy its text they got a modified verstion text content.

jamie-lemon commented 3 weeks ago

For the first question have you tried toStructuredText() , https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/document/index.html#extracting-document-text

const stext = page.toStructuredText("preserve-whitespace").asJSON()
console.log(`stext=${stext}`)
const json = JSON.parse(stext);
console.log(`json=${json}`)

This gives me reasonable font names against text objects.

For the second question I think you would need to redact the content and then insert your own version of the text - so redaction - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/annotations/redactions/index.html and then adding text - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/page/index.html#adding-text-to-pages . The API here is a bit tricky, and we will be working on providing a simpler API for this kind of thing in the future.

longnight commented 2 weeks ago

Thank you for your detailed answer. Migrating pdf lib for custom viewer is not a small project. I will continue to keep an eye on this library until it matures. @jamie-lemon

ArtifexSoftware / mupdf.js

Can I use mupdf.js to get acutal font name, and modify text content before they were rendered to html elements ? #109