Azure-Samples / azure-openai-gpt-4-vision-pdf-extraction-sample

This sample demonstrates how to use GPT-4o with Vision to extract structured JSON data from PDF documents, such as invoices, using the Azure OpenAI Service.
MIT License
57 stars 17 forks source link

Automating Infographics/Graphs Analysis and extracting to structured JSON #3

Open raziurtotha opened 7 months ago

raziurtotha commented 7 months ago

Firstly, I'd like to express my appreciation for the provided sample codes. They've been quite helpful.

I'm currently working with a significant number of PDF files, each containing an array of infographics, graphs, charts, and text. These elements are presented without any systematic order within the documents. My objective is to utilize azure-openai-GPT-4-vision API to comprehend the context and details within these visual elements, subsequently extracting and summarizing this information into structured JSON data, complete with specific key:value pairs. Some of these pairs, such as document_title, author, publication_date, etc., are predefined in the prompt alongside a few-shot examples. At this moment, my process involves handling each PDF file individually with ChatGPT (GPT-4).

Could anyone offer guidance or insights on how to achieve this analysis and data extraction process using the GPT-4-vision API for large number of very unstructed PDF files efficiently?

An example of the pdf file is attached here below: Gen Z (Global) report - GWI.pdf

Any suggestions or advice on streamlining this task would be immensely appreciated.

jamesmcroft commented 6 months ago

Hi @raziurtotha, thanks for taking the time to explore the sample.

To clarify, are you aiming to process multiple documents and consolidate the extracted data? Or is it the intention that you want to scale out document extraction for many differing documents?

I would recommend exploring Durable Functions for either if you intend on these being long running processes that you will return results on later. The benefit to using Durable Functions is that the orchestration state is also stored, so if something were to happen in your environment, it can recover and continue execution.

Durable Functions will give you the flexibility to create workflows, chaining activities together to perform the steps shown in the sample here to split documents, convert them, and prompt the GPT-4 Vision model. For batch processing, you can also adopt the fan out pattern for Functions.

If the request is regarding the first clarifying question, the fan out/fan in pattern described in the link above will be ideal. Essentially batch request fanning out to iterate over each document to perform extraction, and then consolidating the results later.