getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.68k stars 363 forks source link

feat: allow specifying page numbers #23

Closed dfdeagle47 closed 2 months ago

dfdeagle47 commented 2 months ago

Context

The lib uses GPT-4o / GPT-4o-mini to convert document to images. While the upside of GPT is that it has good accuracy, it can be quite slow.

In some cases, you don't want to convert the whole documents and you only need to extract the text from a few pages, and discarding the results after the processing makes it much slower than it could be.

The goal of this PR is to allow to specify which pages to convert to images.

Description

It adds a parameter pagesToConvertAsImages which is set to -1 by default. This parameter is directly passed to the call to pdf2pic which supports specifying the pages to convert.

Thus, this parameter follows the same format as pdf2pic, namely:

Types and README.md are updated accordingly.

Testing notes

Here's a sample script if you want to see how it would be used:

import { zerox } from "./src/index";
import { writeFile } from "node:fs/promises";

async function main() {
  // All pages (i.e. no value provided for `pagesToConvertAsImages`)
  const resultAllPages = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
  });

  await writeFile(
    "output-all-pages.md",
    resultAllPages.pages.map((p) => p.content).join("\n")
  );

  // Page 2
  const resultPage1 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: 2,
  });

  await writeFile(
    "output-page-2.md",
    resultPage1.pages.map((p) => p.content).join("\n")
  );

  // Page 2 and 3
  const resultPage2and3 = await zerox({
    filePath: "https://www.sldttc.org/allpdf/21583473018.pdf",
    openaiAPIKey: process.env.OPENAI_API_KEY,
    pagesToConvertAsImages: [2, 3],
  });

  await writeFile(
    "output-page-2-and-3.md",
    resultPage2and3.pages.map((p) => p.content).join("\n")
  );
}

main();

Open questions

Some questions before moving forwards:

  1. Are you open to adding this parameter to the lib?
  2. I didn't have a lot of inspiration for the parameter name, so any suggestion is welcome.
  3. Currently, if you don't convert all the pages, the result page number in the formattedPages output won't be correct (because we use the array index). Is this a problem?
  4. Currently, if the provided page numbers are out of range, it will crash when calling OpenAI. Should it be handled by the lib, and if so how the error handling should be done (if you have any examples)? Or is it reasonable to expect that the user will enter a valid page range?
  5. Should this be added to the Python lib? I'm not a Python dev, but I can try to add it.
  6. It uses the pdf2pic format to provide the pages. Is this OK, or should it be abstracted? It might also depend what format would be used by the Python lib.

Feel free to let me know if I missed anything (code-wise or documentation-wise).

tylermaran commented 2 months ago

Hey @dfdeagle47, this is a great suggestion. I'll look into this PR later this evening, but generally I think this is a good feature to add in.

pradhyumna85 commented 2 months ago

Hey @dfdeagle47, this is a great suggestion. I'll look into this PR later this evening, but generally I think this is a good feature to add in.

@tylermaran, @dfdeagle47 I can contribute on python side of the implementation. @tylermaran could you please review PR #21, so once that is merged I can raise another PR for this feature in the python SDK. Cheers!

annapo23 commented 2 months ago

@dfdeagle47 Tested it locally, and everything looks great!

annapo23 commented 2 months ago

@dfdeagle47 Just noticed that the page numbers are coming back inconsistent with what they were originally. If I select pages 3 and 5, they come back as 1 and 2. Might be nice to maintain the original page numbers (i.e. if users want to use to reference for citations).

Screen Shot 2024-09-08 at 11 01 44 PM
dfdeagle47 commented 2 months ago

@annapo23

Just noticed that the page numbers are coming back inconsistent with what they were originally.

Yes indeed, that's what I meant in point 3 of the PR description :P.

  1. Currently, if you don't convert all the pages, the result page number in the formattedPages output won't be correct (because we use the array index). Is this a problem?

But I don't know if the PR should have been merged until we've answered all 6 points.

I have an idea for fixing consistency (use the array/number passed in parameter for the mapping), but I was waiting for feedback on point 1 (i.e. if you're OK to support this feature) before spending too much time on it.

I'm OK with me if we revert the PR and I re-open it, to avoid having a half finished feature on the main branch.

dfdeagle47 commented 2 months ago

@annapo23

I made a follow-up PR https://github.com/getomni-ai/zerox/pull/24 to fix the numbering issue.


Regarding the initial open questions I had in this PR, some might need to be addressed.

  1. Are you open to adding this parameter to the lib?

=> Yes, OK to add this feature.

2 . I didn't have a lot of inspiration for the parameter name, so any suggestion is welcome.

=> (do let me know if you want any changes regarding the name)

  1. Currently, if you don't convert all the pages, the result page number in the formattedPages output won't be correct (because we use the array index). Is this a problem?

=> should be addressed by https://github.com/getomni-ai/zerox/pull/24

  1. Currently, if the provided page numbers are out of range, it will crash when calling OpenAI. Should it be handled by the lib, and if so how the error handling should be done (if you have any examples)? Or is it reasonable to expect that the user will enter a valid page range?

=> do you confirm that the behavior in this PR is OK for now?

  1. Should this be added to the Python lib? I'm not a Python dev, but I can try to add it.

=> I don't know the state of the Python lib, i.e. if it should have feature parity with the node lib. If not urgent, it could be addressed in https://github.com/getomni-ai/zerox/pull/21.

  1. It uses the pdf2pic format to provide the pages. Is this OK, or should it be abstracted? It might also depend what format would be used by the Python lib.

=> (do let me know if you have a different opinion on the current implementation)