getomni-ai / zerox

PDF to Markdown with vision models
https://getomni.ai/ocr-demo
MIT License
6.68k stars 363 forks source link

Research: Add bounding boxes to response #7

Open tylermaran opened 3 months ago

tylermaran commented 3 months ago

Generally I would love to have some bounding boxes come back with the text response. Primarily for highlighting locations in the original document where the text got pulled. Not sure exactly how I would proceed with this one, but would love to hear some thoughts.

I think the general flow would be:

  1. Parse the document with gpt mini
  2. Split the resulting markdown into semantic sections (i.e. headers, subheaders, tables, etc.)
  3. For each semantic section, use [insert ai tool] to find bounding boxes in the original image
getwithashish commented 2 months ago

Hey @tylermaran

This seems exciting. Would love to work on it.

I think tweaking the system prompt would do the trick.

getwithashish commented 2 months ago

Hey @tylermaran

I played around with the prompts for sometime. I was able to get the bounding boxes back but it is not 100% accurate. Some boxes are off by 10-20 pixels. Maybe, it is due to the image scaling done by gpt. Looking if that can be solved.

getwithashish commented 2 months ago

image_with_bb

It is able to identify the sections: heading, paragraph, paragraph, table. But, the bounding boxes become more inaccurate when there are more data in the page.

getwithashish commented 2 months ago

Seems like we need to go with a different approach.

This is the flow that I have in mind:

  1. Get the different sections and the corresponding markdown, using GPT
  2. Use an OCR package to extract text and corresponding bounding boxes
  3. Compare it with the obtained markdown, to get the bounding box of a section
tylermaran commented 2 months ago

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

pradhyumna85 commented 2 months ago

Seems like we need to go with a different approach.

This is the flow that I have in mind:

  1. Get the different sections and the corresponding markdown, using GPT
  2. Use an OCR package to extract text and corresponding bounding boxes
  3. Compare it with the obtained markdown, to get the bounding box of a section

@getwithashish i think the most straightforward way and a little modified workflow to do this would be:

So, in this approach the biggest concern I have is cost - how economical would be calling vision model APIs couple of hundred times for every page on different bounding box crops?

The approach you shared, specially in the last step, we have to do a reverse matching, i.e., compare text only (remove markdown formatting) vision model extracted text to tesseract ocr text by fuzzy search to obtain the mapping for each bounding box, there will be certain hyperparams here also like fuzzy matching threshold, text chunking logic, chunk size etc

Could be an interesting milestone though for future to make it comparable, compatible similar to traditional bbox ocr methods.

@tylermaran

getwithashish commented 2 months ago

Hey @getwithashish! This is really promising. Can you share the prompts you were using to get the bounding boxes returned?

@tylermaran

System Prompt 1: Convert the following PDF page to markdown. Return only the markdown with no explanation text. Do not exclude any content from the page.

System Prompt 2: Group each semantic sections like header, footer, body, headings, table and so on. Include the bounding box of the corresponding section in pascal voc format. Image width is 768px and Image height is 768px. The response format should be of the following format: """{"type": "semantic section type", "bbox": [x_min, y_min, x_max, y_max], "markdown": "markdown content of the corresponding section"}""". Make sure to replace semantic section type with the actual type, and [x_min, y_min, x_max, y_max] with the actual bounding box coordinates in Pascal VOC format. Ensure that the markdown content is accurate and includes all relevant data from the page. Only return the contents which are in the page.

I also resized the image according to the docs before sending it:

Image inputs are metered and charged in tokens, just as text inputs are. The token cost of a given image is determined by two factors: its size, and the detail option on each image_url block. All images with detail: low cost 85 tokens each. detail: high images are first scaled to fit within a 2048 x 2048 square, maintaining their aspect ratio. Then, they are scaled such that the shortest side of the image is 768px long. Finally, we count how many 512px squares the image consists of. Each of those squares costs 170 tokens. Another 85 tokens are always added to the final total.

Here are some examples demonstrating the above.

  • A 1024 x 1024 square image in detail: high mode costs 765 tokens 1024 is less than 2048, so there is no initial resize. The shortest side is 1024, so we scale the image down to 768 x 768. 4 512px square tiles are needed to represent the image, so the final token cost is 170 * 4 + 85 = 765.
  • A 2048 x 4096 image in detail: high mode costs 1105 tokens We scale down the image to 1024 x 2048 to fit within the 2048 square. The shortest side is 1024, so we further scale down to 768 x 1536. 6 512px tiles are needed, so the final token cost is 170 * 6 + 85 = 1105.
  • A 4096 x 8192 image in detail: low most costs 85 tokens Regardless of input size, low detail images are a fixed cost.
getwithashish commented 2 months ago

Hey @pradhyumna85,

You’re right—there’s no guarantee that the sections identified through OCR will align perfectly with those derived from markdown. Moreover, using vision models for each bounding box crop is not viable. 😥

The ideal solution would indeed be to leverage the bounding boxes directly from the model that generates the markdown. Since visual grounding is not supported by GPT, I suppose we have to go with a workaround of using OCR.

Instead of relying on visual models, I am working on an algorithm to effectively perform a similarity search between the text extracted via OCR and the model's output. I'm on it like a squirrel with a nut🐿️🤓🥜

pradhyumna85 commented 2 months ago

@getwithashish This would be something interesting. On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

getwithashish commented 2 months ago

@getwithashish This would be something interesting. On the similarity part, have a look at this research paper which interestingly use DTW for the same: Measuring text similarity with dynamic time warping.

The paper was intriguing, but I've got a few qualms. Converting text into a time series and then using DTW sounds pretty cool, but the tricky part is choosing the right keywords from the text. Since the documents can come from any random domain, picking out the right keywords gets a lot harder.

Sure, we could use TF-IDF to select keywords, but that works best for big datasets. When we're dealing with smaller sections, it’s like trying to pick the ripest fruit in a basket while wearing sunglasses indoors — you might grab something, but there's a good chance it’s not what you were looking for.

That said, it’s definitely an interesting approach.

getwithashish commented 2 months ago

cs101_with_bb

I am currently working on the bounding box for the table. With a little more fine-tuning, we should be good to go.

@tylermaran What are your thoughts?

pradhyumna85 commented 2 months ago

@getwithashish, just trying to understand here, how many types of (elements) bounding boxes you are targetting exactly and how?

getwithashish commented 2 months ago

@tylermaran @pradhyumna85

This is the current flow:

getwithashish commented 2 months ago

image

We'll now get section-wise normalized bounding boxes along with content.

getwithashish commented 2 months ago

I will be kicking off the PR today! It’s been a hot minute since I started on this feature, but hey, better late than never. 😄

getwithashish commented 2 months ago

Hey @tylermaran,

PR’s up and ready for your review! 🧐 Let me know what you think!