getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.68k stars 363 forks source link

fix: page numbering when only converting specific pages #24

Closed dfdeagle47 closed 2 months ago

dfdeagle47 commented 2 months ago

Context

Follow up of https://github.com/getomni-ai/zerox/pull/23 (cf. https://github.com/getomni-ai/zerox/pull/23#issuecomment-2337198811) to fix page numbering when not converting all pages.

When we only convert specific pages, the formattedPages.page index will not be correct.

Description

Depending on the use case, we can retrieve the proper index:

Note: pdf2pic will always return the pages in order, regardless the initial order in pagesToConvertAsImages parameter. Therefore, it's better to always sort the pagesToConvertAsImages array for consistency.

Testing notes

Here are a few tests cases (check the output to see if the page number is correct):

import { zerox } from "./src/index";
import { writeFile } from "node:fs/promises";

async function main() {
  // All pages (i.e. no value provided for `pagesToConvertAsImages`)
  const resultAllPages = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
  });
  console.log(
    resultAllPages.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2
  const resultPage1 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: 2,
  });
  console.log(
    resultPage1.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 2 and 3
  const resultPage2and3 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [2, 3],
  });
  console.log(
    resultPage2and3.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );

  // Page 3 and 2
  const resultPage3and2 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [3, 2],
  });
  console.log(
    resultPage3and2.pages.map((p) => ({
      ...p,
      content: p.content.slice(0, 100),
    }))
  );
}

main();
dfdeagle47 commented 2 months ago

@annapo23 @tylermaran

This PR is not urgent, and you might be busy with other stuff, but do let me know if I should change something or you need more context on the PR to merge this fix. Happy to iterate on it as needed.

annapo23 commented 2 months ago

@dfdeagle47 Works well! I'll go ahead and merge the PR.

pradhyumna85 commented 2 months ago

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8]. Any suggestions?

dfdeagle47 commented 2 months ago

@annapo23, @dfdeagle47 I want to add this functionality in the Python SDK. I am just curious on how should the maintain_format = True should be handled where we have pagesToConvertAsImages an array where the pages might not be in continuity like [2,4,8]. Any suggestions?

@pradhyumna85 That's a good question.

As I currently understand it, this parameter is used to maintain the same output markdown format throughout the document, by feeding the previous output page in the prompt. For instance, to have the same types of headings, a consistent tabular format etc.

While it's true that two consecutive pages might be more likely to be similar with each other, I think it doesn't rely too much on any single pages actually. The source document is likely to be consistent throughout, so if you pick two random pages in the source document, the formatting should still be consistent.

So my intuition is that it doesn't matter too much which output page you feed in the context in most cases, and it should still make sense, even when converting specific pages.

Apart from this thought, I don't see an obvious solution because if we need to feed the actual previous page, we'd end up converting the whole document again.

So perhaps, if formatting is really important for the end-user and it doesn't produce the expected result when selecting a few pages, they should try converting the whole document instead. So it's more of a choice for the user of the lib, rather than something that should be handled by the lib itself.

pradhyumna85 commented 2 months ago

@dfdeagle47 that makes sense. I think even if it has to be implemented then in theory maintain_format will only be applicable in continuous subsets of the sorted array like:

@annapo23, @tylermaran, I am thinking on just implementing this PR in the python SDK for now and any further extensions to that can be done later. Let me know your thoughts and how you want this to proceed.