Open longnight opened 3 weeks ago
For the first question have you tried toStructuredText() , https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/document/index.html#extracting-document-text
const stext = page.toStructuredText("preserve-whitespace").asJSON()
console.log(`stext=${stext}`)
const json = JSON.parse(stext);
console.log(`json=${json}`)
This gives me reasonable font names against text objects.
For the second question I think you would need to redact the content and then insert your own version of the text - so redaction - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/annotations/redactions/index.html and then adding text - https://mupdfjs.readthedocs.io/en/latest/how-to-guide/node/page/index.html#adding-text-to-pages . The API here is a bit tricky, and we will be working on providing a simpler API for this kind of thing in the future.
Thank you for your detailed answer. Migrating pdf lib for custom viewer is not a small project. I will continue to keep an eye on this library until it matures. @jamie-lemon
In PyMuPdf I can do this, on pdf files that with text layer:
But in PDF.js, it transfer/change font name to internal identifier likes "g_d0_f18" . Now in mupdf.js , can I extract these text blocks, with actual font name as py script did ?
And question sencond, still for pdf with text layer:
Can I replace/modify some text content before they were rendered into page/html elements in the viewer ? I need to replace some sepecial symbols(they were set in special custom font) into other characters , then when others select then copy its text they got a modified verstion text content.