getPageText crashes on larger PDFs

rybon commented 2 years ago

getPageText() crashes with a RuntimeError: memory access out of bounds exception on PDFs larger than around 20 pages. This happens in the browser / WebWorker and Node.js. On Node.js it crashes with bus error

rybon commented 2 years ago

Example to test with: https://arxiv.org/pdf/1708.08021.pdf

For reference, this issue does not occur with PyMuPDF.

andytango commented 2 years ago

Hey @rybon , this is likely because the in-memory representation of the PDF page exceeds the memory allocated to the WASM worker. This limit varies depending the JS engine, see for example V8.

It might be possible to work around this using the browser's memory API but this isn't something I'm familiar with.

For reference, this issue does not occur with PyMuPDF.

This isn't surprising, as that library uses native C bindings instead of webassembly.

Thanks for providing example file, I'll use this to investigate further.

rybon commented 2 years ago

Thanks! I appreciate your efforts.

By the way, I encountered this issue by looping over the number of pages in the document and grabbing the text per page. It seems memory is not freed in C after the text is retrieved. I tried to fix it by mucking about in the C code (adding a few fz_free and friends method calls) and recompiling to WASM, but wasn't successful.

I tried version 1.20.3 too, but it seems the API interface changed quite a bit in that version.

rybon commented 2 years ago

Fortunately there's a workaround: loading the same PDF with PDF.js and grabbing the page text via its API instead. This works in the browser, web worker and Node.js.

andytango commented 2 years ago

@rybon I've added an automated test suite on this branch.

In it, I've tested the entire API and in particular ran getPageText with your example file, but this hasn't caused a crash and has successfully written the output to a jest snapshot.

Keen to diagnose further - could you let me know which OS, architecture and hardware you've been running this on, as well as which version of NodeJS and which browser?

Fortunately there's a workaround: loading the same PDF with PDF.js and grabbing the page text via its API instead. This works in the browser, web worker and Node.js.

That makes sense, PDF.js is potentially a better alternative for this use case. I originally created MuPDF.js as a way to render PDF files to images in the browser.

bfarmilo commented 2 years ago

Hi, just adding onto this, similar problem in getting a memory access out of bounds error when looping through multiple pages of a pdf with getPageText() (initially triggered when I was sequentially scanning 15 different PDF's, but in my case for PDF's exceeding even one page). I'm on Windows 11, AMD64, Node 16.13.2. Ultimately the browser will be Electron 18.3.13 but the error is thrown in my testing, which is just running Node. I don't have a sample file to test at the moment since the PDF's I'm scanning are confidential but I'll see if I can find a generic example to attach

andytango commented 2 years ago

Thanks @bfarmilo ! I'll see if I can reproduce the issue on different architectures and node versions using cloud services.

It looks like WASM's promise of portability has some caveats 🤔 this package might well need compatibility information in the documentation.

rybon commented 2 years ago

I managed to resolve all these issues by reusing the build setup of this repo with some minor tweaks and compiling to WASM from the mupdf source code on GitHub. I am no longer using this package itself. The resulting build output requires some minor tweaks as well, but now works flawlessly in the browser and Node.js.

rybon commented 2 years ago

By the way, I also started seeing these issues with searchPageText, so that's why I had to move away from using this package. Compiling to WASM myself resolved it as well.

andytango commented 2 years ago

@rybon Great to hear that you managed to get it working.

Would you mind letting me know what the minor tweaks were that you made to the build step? Or even submit a PR? I still haven't been able to replicate the issue on my side but looking at your tweaks might help me to figure it out.

~~Thanks!~~

Update:

I've been able to replicate the issue in the test suite.

Based on what you've said, it looks like the cause is the options I'm passing to emscripten/cmake, if you're having success building from the original mupdf sources instead of using my build scripts.

Thank you for the pointer - I'm confident I can resolve this issue now.

rybon commented 2 years ago

@andytango I copied the contents of the mupdf GitHub repo (after cloning and initializing its Git submodules) to the overrides folder. The only tweaks I had to make to the platform/wasm/Makefile was to remove the pthread flag in 2 places and to remove mupdf-wasm.worker.js.

I cannot make use of SharedArrayBuffer (can't set the required HTTP headers to enable it) which is what WASM uses via its pthread compilation flag to enable multi-threading. And I organized my Web Worker implementation in a different way than what mupdf-wasm.worker.js prescribes. I am using a different method of multi-threading by spinning up several workers and handing each worker a slice of the PDF to process with mupdf WASM. This is very fast. I've seen scanned and OCR'd PDFs of around a 1000 pages handled in less than 20 seconds.

Further tweaks:

remove the top cp statement in bin/build.sh and rename libmupdf to mupdf-wasm
in bin/build.ts Binds change the first entry to resolve to ./overrides because we're no longer using the tmp folder
skip bin/prep.ts entirely

The build will output a mupdf-wasm.wasm and a mupdf-wasm.js file in dist. You'll also need platform/wasm/lib/mupdf.js. Copy these 3 files to your project. Edit mupdf.js and at the top ensure var libmupdf = require("./mupdf-wasm.js");, then remove the section at the bottom stating let buffer = libmupdf.wasmMemory.buffer; and the if and else block beneath it. Replace with return true;.

In your project code:

import { default as mupdf } from "./mupdf.js"; await mupdf.ready; let openDocument = mupdf.Document.openFromJsBuffer(pdf, fileName); // pdf is for example fs.readFileSync("some.pdf").buffer; and fileName "some.pdf" const numberOfPages = openDocument.countPages(); // perform mupdf API calls here... openDocument?.free(); openDocument = null;

Note: some API calls require a dpi parameter. Use 72 as a default.

Note: running this code in the browser requires using Parcel as the bundler.

rybon commented 2 years ago

Some notes for those who want to process a PDF in parallel (browser example, Node.js requires a slightly different API):

install PDF-LIB npm install --save pdf-lib
count the number of pages of the PDF
grab the number of CPU cores available via navigator.hardwareConcurrency
partition the pages according to the number of pages and the number of CPU cores (e.g. 100 pages and 10 CPU cores means 10 pages per worker in parallel)

Spin up as many workers as CPU cores and pass each of them a slice of pages:

import { PDFDocument } from "pdf-lib"; const pdfDoc = await PDFDocument.create(); const srcDoc = await PDFDocument.load(pdf); // see example in comment above const copiedPages = await pdfDoc.copyPages(srcDoc, [1, 2, 3, 4, 5, 6, 7, 8, 9, 10].map((p) => p - 1)); // example of partitioned pages list copiedPages.forEach((p) => { pdfDoc.addPage(p); }); const pdfBytes = await pdfDoc.save(); cpuWorker.postMessage({ pdf: pdfBytes.buffer, fileName: "some.pdf" }, [pdfBytes.buffer]);

In the worker use the same mupdf code as stated in the previous comment.

andytango commented 2 years ago

@rybon

I managed to resolve all these issues by reusing the build setup of this repo with some minor tweaks and compiling to WASM from the mupdf source code on GitHub. I am no longer using this package itself. The resulting build output requires some minor tweaks as well, but now works flawlessly in the browser and Node.js.

I've looked into this and also ran some builds from the HEAD of the master branch on the mupdf sources on GitHub. Am I right in thinking that you also worked from the latest commit on their master branch?

This is an unstable development branch, and looking through the code, there are a number of TODOs and FIXMEs in their wrapper JS file at platform/wasm/lib/mupdf.js. On top of this, the JS API has been completely rewritten, so I can't easily migrate mupdf-js to use this version of mupdf yet.

What I can do in the meantime is update the documentation with your helpful comments on this thread, so that anyone else who is encountering memory or performance issues can workaround them.

It is also very useful to see the changes that will be coming to the next release of mupdf ahead of time, because of the breaking changes and also what looks like some significant improvements to it in addition to fixing the memory issues we've been seeing.

This is particularly the case given it will take quite some time to write Typescript declarations for this new API, so again, thank you for the heads up.

rybon commented 2 years ago

Yeah, from master. I think from commit 9106030. I realize running the bleeding edge is risky, but noticed in their commit history they improved memory management for the WASM platform quite a lot. And it definitely helped me.

anthrax63 commented 1 year ago

I made PR https://github.com/andytango/mupdf-js/pull/58 which could fix this problem

andytango / mupdf-js

getPageText crashes on larger PDFs #9