Use JavaScript for PDF conversion and OCR

stweil commented 2 months ago

Currently zotero-ocr requires additional installation steps for pdftoppm and tesseract.

Both could be replaced by pure JavaScript implementations which could be included in zotero-ocr to simplify the installation:

In a first step we could start with pdf.js.

aborel commented 2 months ago

Interesting idea. Since the Zotero PDF viewer is based on pdf.js (as far as I understand), does this mean we're actually getting part of the pipeline out of the box?

stweil commented 2 months ago

Yes, I think so (I did not notice that pdf.js is already part of Zotero, so the suggested first step would not increase the size of zotero-ocr). The code of pdf2text-ocr should show the requires steps to get the input for Tesseract from a PDF file.

stweil commented 2 months ago

On my MacBook /Applications/Zotero_7.0.3.app/Contents/Resources/omni.ja is a ZIP file which includes pdfjs. So you are right, Zotero already provides it.

stweil commented 1 week ago

claude.ai suggests this code:

async function getPageImageData(pdfUrl) {
  try {
    // Load the PDF document using Zotero's PDF.js integration
    const pdf = await Zotero.PDF.getDocument(pdfUrl);
    const numPages = await pdf.numPages;

    // Iterate through each page and extract the image data
    const pageImageData = [];
    for (let pageNumber = 1; pageNumber <= numPages; pageNumber++) {
      const page = await pdf.getPage(pageNumber);
      const viewport = await page.getViewport({ scale: 1 });
      const canvas = document.createElement('canvas');
      const context = canvas.getContext('2d');

      canvas.height = viewport.height;
      canvas.width = viewport.width;

      // Render the page on the canvas
      const renderContext = {
        canvasContext: context,
        viewport: viewport
      };
      await page.render(renderContext);

      // Get the image data from the canvas
      const imageData = context.getImageData(0, 0, canvas.width, canvas.height);
      pageImageData.push(imageData);
    }

    return pageImageData;
  } catch (error) {
    console.error('Error extracting PDF image data:', error);
    throw error;
  }
}

Here's how the code works:

The getPageImageData function takes a PDF file URL as input.
It uses the Zotero.PDF.getDocument method to load the PDF document using Zotero's PDF.js integration.
It then iterates through each page of the PDF document, rendering the page on a canvas element.
For each page, it extracts the image data using the getImageData method of the canvas context.
The extracted image data for each page is collected and returned as an array.

You can use this function in your Zotero plugin to extract the image data for each page of a scanned PDF file. The resulting pageImageData array will contain the image data for each page, which you can then process or store as needed for your plugin's functionality.

Remember that this code assumes you have access to the PDF file URL. If you need to work with a PDF file stored in the Zotero user's library, you'll need to use the appropriate Zotero APIs to retrieve the file path or URL.

UB-Mannheim / zotero-ocr

Use JavaScript for PDF conversion and OCR #80