getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.68k stars 363 forks source link

Feat. Postprocessing control - custom page separator, postprocess function etc #40

Open pradhyumna85 opened 2 months ago

pradhyumna85 commented 2 months ago

To accommodate and resolve #37

Changes

Note: This PR adds changes on top of PR #39. If merged, this will accommodate changes of PR #39, which won't require the previous PR to be merged.

Edit: Fixes #42

tylermaran commented 2 months ago

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

pradhyumna85 commented 2 months ago

I'll take a look and test this one as well. I like the page_separator optional param. I was thinking of adding a <=== Page {x} ===> to our own use of zerox the other day. So make sense!

@tylermaran, you want to provide an option to pass a string with fixed placeholder like {page_no} (if this placeholder is not found then we don't populate anything) and populate that with the page number while writing the output markdown file (if the user has choosed to output)?

tylermaran commented 2 months ago

Hey @pradhyumna85, thinking through this a bit more. Right now we return an array of objects (including page number, content, etc.). So for our day to day use I was doing:

const result = await zerox(...)
const aggregatedText = result.pages.map((el) => el.content).join('\n\n');

But if we're writing to the output directory, it makes sense to have a configurable page deliminator built in. Although something like === Page {X} === might be a better default than \n\n.

pradhyumna85 commented 2 months ago

@tylermaran, Added the functionality for you to have a look.

pradhyumna85 commented 1 month ago

@tylermaran could you review this PR for merging.

Also should I incorporate the new system prompt from #48?