Open stweil opened 2 months ago
Interesting idea. Since the Zotero PDF viewer is based on pdf.js (as far as I understand), does this mean we're actually getting part of the pipeline out of the box?
Yes, I think so (I did not notice that pdf.js is already part of Zotero, so the suggested first step would not increase the size of zotero-ocr). The code of pdf2text-ocr should show the requires steps to get the input for Tesseract from a PDF file.
On my MacBook /Applications/Zotero_7.0.3.app/Contents/Resources/omni.ja
is a ZIP file which includes pdfjs. So you are right, Zotero already provides it.
claude.ai suggests this code:
async function getPageImageData(pdfUrl) {
try {
// Load the PDF document using Zotero's PDF.js integration
const pdf = await Zotero.PDF.getDocument(pdfUrl);
const numPages = await pdf.numPages;
// Iterate through each page and extract the image data
const pageImageData = [];
for (let pageNumber = 1; pageNumber <= numPages; pageNumber++) {
const page = await pdf.getPage(pageNumber);
const viewport = await page.getViewport({ scale: 1 });
const canvas = document.createElement('canvas');
const context = canvas.getContext('2d');
canvas.height = viewport.height;
canvas.width = viewport.width;
// Render the page on the canvas
const renderContext = {
canvasContext: context,
viewport: viewport
};
await page.render(renderContext);
// Get the image data from the canvas
const imageData = context.getImageData(0, 0, canvas.width, canvas.height);
pageImageData.push(imageData);
}
return pageImageData;
} catch (error) {
console.error('Error extracting PDF image data:', error);
throw error;
}
}
Here's how the code works:
getPageImageData
function takes a PDF file URL as input.Zotero.PDF.getDocument
method to load the PDF document using Zotero's PDF.js integration.getImageData
method of the canvas context.You can use this function in your Zotero plugin to extract the image data for each page of a scanned PDF file. The resulting pageImageData
array will contain the image data for each page, which you can then process or store as needed for your plugin's functionality.
Remember that this code assumes you have access to the PDF file URL. If you need to work with a PDF file stored in the Zotero user's library, you'll need to use the appropriate Zotero APIs to retrieve the file path or URL.
Currently zotero-ocr requires additional installation steps for
pdftoppm
andtesseract
.Both could be replaced by pure JavaScript implementations which could be included in zotero-ocr to simplify the installation:
In a first step we could start with pdf.js.