Closed dfdeagle47 closed 2 months ago
Hey @dfdeagle47, this is a great suggestion. I'll look into this PR later this evening, but generally I think this is a good feature to add in.
Hey @dfdeagle47, this is a great suggestion. I'll look into this PR later this evening, but generally I think this is a good feature to add in.
@tylermaran, @dfdeagle47 I can contribute on python side of the implementation. @tylermaran could you please review PR #21, so once that is merged I can raise another PR for this feature in the python SDK. Cheers!
@dfdeagle47 Tested it locally, and everything looks great!
@dfdeagle47 Just noticed that the page numbers are coming back inconsistent with what they were originally. If I select pages 3 and 5, they come back as 1 and 2. Might be nice to maintain the original page numbers (i.e. if users want to use to reference for citations).
@annapo23
Just noticed that the page numbers are coming back inconsistent with what they were originally.
Yes indeed, that's what I meant in point 3 of the PR description :P.
- Currently, if you don't convert all the pages, the result page number in the
formattedPages
output won't be correct (because we use the array index). Is this a problem?
But I don't know if the PR should have been merged until we've answered all 6 points.
I have an idea for fixing consistency (use the array/number passed in parameter for the mapping), but I was waiting for feedback on point 1 (i.e. if you're OK to support this feature) before spending too much time on it.
I'm OK with me if we revert the PR and I re-open it, to avoid having a half finished feature on the main
branch.
@annapo23
I made a follow-up PR https://github.com/getomni-ai/zerox/pull/24 to fix the numbering issue.
Regarding the initial open questions I had in this PR, some might need to be addressed.
Are you open to adding this parameter to the lib?
=> Yes, OK to add this feature.
2 . I didn't have a lot of inspiration for the parameter name, so any suggestion is welcome.
=> (do let me know if you want any changes regarding the name)
Currently, if you don't convert all the pages, the resultpage
number in theformattedPages
output won't be correct (because we use the array index). Is this a problem?
=> should be addressed by https://github.com/getomni-ai/zerox/pull/24
- Currently, if the provided page numbers are out of range, it will crash when calling OpenAI. Should it be handled by the lib, and if so how the error handling should be done (if you have any examples)? Or is it reasonable to expect that the user will enter a valid page range?
=> do you confirm that the behavior in this PR is OK for now?
- Should this be added to the Python lib? I'm not a Python dev, but I can try to add it.
=> I don't know the state of the Python lib, i.e. if it should have feature parity with the node lib. If not urgent, it could be addressed in https://github.com/getomni-ai/zerox/pull/21.
- It uses the
pdf2pic
format to provide the pages. Is this OK, or should it be abstracted? It might also depend what format would be used by the Python lib.
=> (do let me know if you have a different opinion on the current implementation)
Context
The lib uses GPT-4o / GPT-4o-mini to convert document to images. While the upside of GPT is that it has good accuracy, it can be quite slow.
In some cases, you don't want to convert the whole documents and you only need to extract the text from a few pages, and discarding the results after the processing makes it much slower than it could be.
The goal of this PR is to allow to specify which pages to convert to images.
Description
It adds a parameter
pagesToConvertAsImages
which is set to-1
by default. This parameter is directly passed to the call topdf2pic
which supports specifying the pages to convert.Thus, this parameter follows the same format as
pdf2pic
, namely:-1
to convert all pagesnumber
(e.g.1
) to convert a single pagenumber[]
(e.g.[1, 2, 3]
) to convert multiple pagesTypes and README.md are updated accordingly.
Testing notes
Here's a sample script if you want to see how it would be used:
Open questions
Some questions before moving forwards:
page
number in theformattedPages
output won't be correct (because we use the array index). Is this a problem?pdf2pic
format to provide the pages. Is this OK, or should it be abstracted? It might also depend what format would be used by the Python lib.Feel free to let me know if I missed anything (code-wise or documentation-wise).