Open rybon opened 2 years ago
Example to test with: https://arxiv.org/pdf/1708.08021.pdf
For reference, this issue does not occur with PyMuPDF.
Hey @rybon , this is likely because the in-memory representation of the PDF page exceeds the memory allocated to the WASM worker. This limit varies depending the JS engine, see for example V8.
It might be possible to work around this using the browser's memory API but this isn't something I'm familiar with.
For reference, this issue does not occur with PyMuPDF.
This isn't surprising, as that library uses native C bindings instead of webassembly.
Thanks for providing example file, I'll use this to investigate further.
Thanks! I appreciate your efforts.
By the way, I encountered this issue by looping over the number of pages in the document and grabbing the text per page. It seems memory is not freed in C after the text is retrieved. I tried to fix it by mucking about in the C code (adding a few fz_free
and friends method calls) and recompiling to WASM, but wasn't successful.
I tried version 1.20.3
too, but it seems the API interface changed quite a bit in that version.
Fortunately there's a workaround: loading the same PDF with PDF.js and grabbing the page text via its API instead. This works in the browser, web worker and Node.js.
@rybon I've added an automated test suite on this branch.
In it, I've tested the entire API and in particular ran getPageText
with your example file, but this hasn't caused a crash and has successfully written the output to a jest snapshot.
Keen to diagnose further - could you let me know which OS, architecture and hardware you've been running this on, as well as which version of NodeJS and which browser?
Fortunately there's a workaround: loading the same PDF with PDF.js and grabbing the page text via its API instead. This works in the browser, web worker and Node.js.
That makes sense, PDF.js is potentially a better alternative for this use case. I originally created MuPDF.js as a way to render PDF files to images in the browser.
Hi, just adding onto this, similar problem in getting a memory access out of bounds error when looping through multiple pages of a pdf with getPageText() (initially triggered when I was sequentially scanning 15 different PDF's, but in my case for PDF's exceeding even one page). I'm on Windows 11, AMD64, Node 16.13.2. Ultimately the browser will be Electron 18.3.13 but the error is thrown in my testing, which is just running Node. I don't have a sample file to test at the moment since the PDF's I'm scanning are confidential but I'll see if I can find a generic example to attach
Thanks @bfarmilo ! I'll see if I can reproduce the issue on different architectures and node versions using cloud services.
It looks like WASM's promise of portability has some caveats 🤔 this package might well need compatibility information in the documentation.
I managed to resolve all these issues by reusing the build setup of this repo with some minor tweaks and compiling to WASM from the mupdf source code on GitHub. I am no longer using this package itself. The resulting build output requires some minor tweaks as well, but now works flawlessly in the browser and Node.js.
By the way, I also started seeing these issues with searchPageText
, so that's why I had to move away from using this package. Compiling to WASM myself resolved it as well.
@rybon Great to hear that you managed to get it working.
Would you mind letting me know what the minor tweaks were that you made to the build step? Or even submit a PR? I still haven't been able to replicate the issue on my side but looking at your tweaks might help me to figure it out.
Thanks!
Update:
I've been able to replicate the issue in the test suite.
Based on what you've said, it looks like the cause is the options I'm passing to emscripten/cmake, if you're having success building from the original mupdf sources instead of using my build scripts.
Thank you for the pointer - I'm confident I can resolve this issue now.
@andytango I copied the contents of the mupdf GitHub repo (after cloning and initializing its Git submodules) to the overrides
folder. The only tweaks I had to make to the platform/wasm/Makefile
was to remove the pthread
flag in 2 places and to remove mupdf-wasm.worker.js
.
I cannot make use of SharedArrayBuffer
(can't set the required HTTP headers to enable it) which is what WASM uses via its pthread
compilation flag to enable multi-threading. And I organized my Web Worker implementation in a different way than what mupdf-wasm.worker.js
prescribes. I am using a different method of multi-threading by spinning up several workers and handing each worker a slice of the PDF to process with mupdf WASM. This is very fast. I've seen scanned and OCR'd PDFs of around a 1000 pages handled in less than 20 seconds.
Further tweaks:
cp
statement in bin/build.sh
and rename libmupdf
to mupdf-wasm
bin/build.ts
Binds
change the first entry to resolve to ./overrides
because we're no longer using the tmp
folderbin/prep.ts
entirelyThe build will output a mupdf-wasm.wasm
and a mupdf-wasm.js
file in dist
. You'll also need platform/wasm/lib/mupdf.js
. Copy these 3 files to your project.
Edit mupdf.js
and at the top ensure var libmupdf = require("./mupdf-wasm.js");
, then remove the section at the bottom stating let buffer = libmupdf.wasmMemory.buffer;
and the if
and else
block beneath it. Replace with return true;
.
In your project code:
import { default as mupdf } from "./mupdf.js";
await mupdf.ready;
let openDocument = mupdf.Document.openFromJsBuffer(pdf, fileName); // pdf is for example fs.readFileSync("some.pdf").buffer; and fileName "some.pdf"
const numberOfPages = openDocument.countPages();
// perform mupdf API calls here...
openDocument?.free();
openDocument = null;
Note: some API calls require a dpi
parameter. Use 72
as a default.
Note: running this code in the browser requires using Parcel as the bundler.
Some notes for those who want to process a PDF in parallel (browser example, Node.js requires a slightly different API):
npm install --save pdf-lib
navigator.hardwareConcurrency
Spin up as many workers as CPU cores and pass each of them a slice of pages:
import { PDFDocument } from "pdf-lib";
const pdfDoc = await PDFDocument.create();
const srcDoc = await PDFDocument.load(pdf); // see example in comment above
const copiedPages = await pdfDoc.copyPages(srcDoc, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].map((p) => p - 1)); // example of partitioned pages list
copiedPages.forEach((p) => { pdfDoc.addPage(p); });
const pdfBytes = await pdfDoc.save();
cpuWorker.postMessage({ pdf: pdfBytes.buffer, fileName: "some.pdf" }, [pdfBytes.buffer]);
In the worker use the same mupdf code as stated in the previous comment.
@rybon
I managed to resolve all these issues by reusing the build setup of this repo with some minor tweaks and compiling to WASM from the mupdf source code on GitHub. I am no longer using this package itself. The resulting build output requires some minor tweaks as well, but now works flawlessly in the browser and Node.js.
I've looked into this and also ran some builds from the HEAD of the master branch on the mupdf sources on GitHub. Am I right in thinking that you also worked from the latest commit on their master branch?
This is an unstable development branch, and looking through the code, there are a number of TODOs and FIXMEs in their wrapper JS file at platform/wasm/lib/mupdf.js
. On top of this, the JS API has been completely rewritten, so I can't easily migrate mupdf-js
to use this version of mupdf yet.
What I can do in the meantime is update the documentation with your helpful comments on this thread, so that anyone else who is encountering memory or performance issues can workaround them.
It is also very useful to see the changes that will be coming to the next release of mupdf
ahead of time, because of the breaking changes and also what looks like some significant improvements to it in addition to fixing the memory issues we've been seeing.
This is particularly the case given it will take quite some time to write Typescript declarations for this new API, so again, thank you for the heads up.
Yeah, from master. I think from commit 9106030
. I realize running the bleeding edge is risky, but noticed in their commit history they improved memory management for the WASM platform quite a lot. And it definitely helped me.
I made PR https://github.com/andytango/mupdf-js/pull/58 which could fix this problem
getPageText()
crashes with aRuntimeError: memory access out of bounds
exception on PDFs larger than around 20 pages. This happens in the browser / WebWorker and Node.js. On Node.js it crashes withbus error