Azure / multimodal-ai-llm-processing-accelerator

Build multimodal data processing pipelines with Azure AI Services + LLMs
MIT License
73 stars 31 forks source link

name: Azure Multimodal AI & LLM Processing Accelerator (Python) description: Build data processing pipelines with Azure AI Services + LLMs languages:


Azure Multimodal AI + LLM Processing Accelerator

Table of Contents

Overview

This accelerator is as a customizable code template for building and deploying production-grade data processing pipelines that incorporate Azure AI services and Azure OpenAI/AI Studio LLM models. It uses a variety of data pre-processing and enrichment components to make it easy to build complex, reliable and accurate pipelines that solve real-world use cases. If you'd like to use AI to summarize, classify, extract or enrich your data with structured and reliable outputs, this is the code repository for you.

Important Note: This accelerator is currently under development and may include regular breaking changes

It is recommended to review the main repo before pulling new changes, as work is in progress to replace many of the third-party components (e.g. those imported from Haystack) with more complete, performant & fully-featured components. Once the core application is stable, a standard release pattern with semantic versioning will be used to manage releases.

Solution Design

Solution Design

Key features

Why use this accelerator?

Most organisations have a huge number of simple and tasks and processes that consume large amounts of time and energy. These could be things like classifying and extracting information from documents, summarizing and triaging customer emails, or transcribing and running compliance tasks on contact centre call recordings. While some of these tasks can be automated with existing tools and services, they often require a lot of up-front investment to fully configure and customize in order to have a reliable, working solution. They can also be perform poorly when dealing with input data that is slightly different than expected, and may never be the right fit for scenarios that require the solution to be flexible or adaptable.

On the other hand, Large Language Models have emerged as a powerful and general-purpose approach that is able to handle these complex and varied situations. And more recently, with the move from text-only models to multimodal models that can incorporate text, audio and video, they are a powerful tool that we can use to automate a wide variety of everyday tasks. But while LLMs are powerful and flexible, they have their own shortcomings when it comes to providing precise and reliable outputs, and they too can be sensitive to the quality of raw and unprocessed input data.

Approach + Examples Strengths Weaknesses
Domain-specific AI models
- OCR
- Speech-to-text
- Object detection
- Generally better performance on specialized tasks
- Consistent performance and output format
- Cost-efficient & Scalable
- Outputs may require translation into human-friendly format
- Larger up-front development cost
- Tuning & customization may be more time-consuming
- Customized models may be less reliable/flexible with unseen data
Large Language Models
- Azure OpenAI
- Open Source Models
- Define behaviour with natural language
- Shorter up-front development time
- More flexible with wide-ranging input data
- Outputs are in human-friendly format
- Non-deterministic & lower reliability of outputs
- Harder to constrain & test (a black box)
- No consistent, structured metadata by default
- Uncalibrated, no confidence score (when can the outputs be trusted?)
- Expensive and slow, especially for transcription/translation tasks

This accelerator provides the tools and patterns required to combine the best of both worlds in your production workloads, giving you the reasoning power, flexibility and development speed of Large Language Models, while using domain-specific AI during pre and post-processing to increase the consistency, reliability, cost-efficiency of the overall system.

Example pipeline output

Here is an example of the pre-built Form Field Extraction pipeline. By combining the structured outputs from Azure Document Intelligence with GPT-4o, we can verify and enrich the values extracted by GPT-4o with confidence scores, bounding boxes, style and more. This allows us to make sure the LLM has not hallucinated, and allows us to automatically flag the document for human review if the confidence scores do not meet our minimum criteria (in this case, all values must have a Document Intelligence confidence score above 80% to avoid human review).

Form Extraction Example

Real-World Case Study

In a recent customer project that involved extracting Order IDs from scanned PDFs and phone images, we used a number of these techniques to increase the performance of GPT-4o-alone from ~60% to near-perfect accuracy:

At the conclusion of this project, our customer was able to deploy the solution and automate the majority of their processing workload with confidence, knowing that any cases that were too challenging for the LLM would automatically be escalated for review. Reviews can now be completed in a fraction of the time thanks to the additional metadata returned with each result.

Prebuilt pipelines

The accelerator comes with these pre-built pipeline examples to help you get started. Each pipeline is built in its own python file as a function blueprint, and then imported and added to the main function app within function_app/function_app.py.

Example Description & Pipeline Steps
Form Field Extraction with Confidence Scores & bboxes
(HTTP)
Code
Extracts key information from a PDF form and returns field-level and overall confidence scores and whether human review is required.
- PyMuPDF (PDF -> Image)
- Document Intelligence (PDF -> text)
- GPT-4o (text + image input)
- Post-processing:
    - Match LLM field values with Document Intelligence extracted lines
    - Merge Confidence scores and bounding boxes
    - Determine whether to human review is required
- Return structured JSON
Call Center Analysis with Confidence Scores & Timestamps
(HTTP)
Code
Processes a call center recording, classifying customer sentiment & satisfaction, summarizing the call and next best action, and extracting any keywords mentioned. Returns the response with timestamps, confidence scores and the full sentence text for the next best action and each of the keywords mentioned.
- Azure AI Speech (Speech -> Text)
- GPT-4o (text input)
- Post-processing:
    - Match LLM timestamps to transcribed phrases
    - Merge sentence info & confidence scores
- Return structured JSON
Form Field Extraction
(Blob -> CosmosDB)
Code: Func, Pipeline
Summarizes text input into a desired style and number of output sentences.
- Pipeline triggered by blob storage event
- PyMuPDF (PDF -> Image)
- Document Intelligence (PDF -> text)
- GPT-4o (text + image input)
- Write structured JSON result to CosmosDB container.
Summarize Text
(HTTP)
Code
Summarizes text input into a desired style and number of output sentences.
- GPT-4o (text input + style/length instructions)
- Return raw text
Multimodal Document Intelligence Processing
(HTTP)
Code
A pipeline showcasing the highly configurable Document Intelligence Processor that intelligently processes the raw Doc Intelligence API response to extract text, images and tables from a PDF/image into a more usable and flexible format.
- Document Intelligence (PDF/image -> text + images + tables)
- Return content as Markdown
City Names Extraction, Doc Intelligence
(HTTP)
Code
Uses GPT-4o to extract all city names from a given PDF (using text extracted by Document Intelligence).
- Document Intelligence (PDF/image -> text)
- GPT-4o (text input)
- Return JSON array of city names
City Names Extraction, PyMuPDF
(HTTP)
Code
Uses GPT-4o to extract all city names from a given PDF/image + text (extracted locally by PyMuPDF).
- PyMuPDF (PDF/image -> text & images)
- GPT-4o (text + image input)
- Return JSON array of city names

These pipelines can be duplicated and customized to your specific use case, and should be modified as required. The pipelines all return a large amount of additional information (such as intermediate outputs from each component, time taken for each step, and the raw source code) which will usually not be required in production use cases. Make sure to review the code thoroughly prior to deployment.

Demo web app

The accelerator comes with an included web app for demo and testing purposes. This webapp is built with Gradio, a lightweight Python UI library, to enable interaction with the backend pipelines from within the browser. The app comes prebuilt with a tab for each of the prebuilt pipelines, along with a few example files for use with each pipeline. The demo app also

Common scenarios & use cases

Roadmap & upcoming features

This accelerator is in active development, with a list of upcoming features including:

To help prioritise these features or request new ones, please head to the Issues section of this repository.

FAQ

How can I get started with a solution for my own use case?

The demo pipelines are examples and require customization in order to have them work accurately in production. The best strategy to get started is to clone one of the existing demo pipelines and modify them for your own purpose. The following steps are recommended:

  1. Fork this repository into your own Github account/organization, then clone the repository to your local machine.
  2. Follow the instructions in the deployment section to setup and deploy the code, then test out some of the demo pipelines to understand how they work.
  3. Walk through the code for the pipelines that are the most similar to what you would like to build, or which have the different components that you want to use.
    • For example, if you want to build a document extraction pipeline, start with the pipelines that use Azure Document Intelligence.
    • If you want to then combine this with AI Speech or with a different kind of trigger, look through the other pipelines for examples of those.
    • Once familiar with the example pipelines, you should be able to see how you can plug different pipeline components together by into an end-to-end solution.
  4. Clone the python blueprint file (e.g. function_app/bp_<pipeline_name>.py) that is most similar to your ideal use case, renaming it and using it as a base to start with.
  5. Review and modify the different parts of the pipeline. The common things are:
    1. The AI/LLM components that are used and their configurations.
    2. The Azure Function route and required input/output schemas and validation logic.
    3. The Pydantic classes and definitions that define the schema of the LLM's response and the response to be returned from the API.
      • The repo includes a useful Pydantic base model (LLMRawResponseModel) that makes it easy to print out the JSON schema in a prompt-friendly way, and it is suggested to use this model to define your schema so that you can easily provide it to your model and then validate the LLM's responses.
      • By default, these include a lot of additional information from each step of the pipeline, but you may want to remove, modify or add new fields.
    4. The LLM system prompt(s), which contain instructions on how the LLM should complete the task.
      • All prompt examples in this repo are very basic and it is recommended to spend time crafting detailed instructions for the LLM and including some few-shot examples.
      • These should be in addition to the JSON schema definition - if you use the JSON schema alone, expect that the model will make a number of mistakes (you can see this occur in some of the example pipelines).
    5. The post-processing validation logic. This is how you automatically determine when to trust the outputs and when to escalate to human review.
  6. Once you have started making progress on the core processing pipeline, you may want to modify the demo web app (demo_app/) so that you can easily test the endpoint end-to-end.
    • The Gradio app has a tab built for each of the Function app pipelines, and you should start with the code built for the base of your new function app pipeline.
    • If you need different data inputs or a different request schema (e.g. switching from sending a single file to a file with other JSON parameters), check out each of the other pipelines. These will help you determine how to build the front-end and API request logic so that things work end-to-end.
    • Once you have these working together, you can easily iterate and test your pipelines quickly with the demo web app via the UI.
  7. When your pipeline is working end-to-end, it's time to think about testing & evaluating the accuracy and reliability of your solution.
    • It is critical with any AI system to ensure that the pipeline is evaluated on a representative sample of validation data.
    • Without this, it is impossible to know how accurate the solution is, or whether the solution fails under specific circumstances. This is often the time-consuming step of building and deployment an AI solution but is also the most important.
    • While more tools to help simplify this process are coming soon, you should take a look at the evaluation tools within Azure AI Studio.
  8. Finally, it's time to deploy your custom application to Azure.
    • Review and modify the infrastructure templates and parameters to ensure the solution is deployed to your requirements
    • Setup automated CI/CD deployment pipelines using Github Actions or Azure DevOps (base templates are coming to the repo soon).

Does this repo use or support Langchain/Llamaindex/Framework X?

There are many different frameworks available for LLM/Generative AI applications, each offering different features, integrations, and production suitability. This accelerator uses some existing components from Haystack, but it is framework agnostic and you can use any or all frameworks for your pipelines. This allows you to take advantage of the solution architecture and many of the helper functions while still having full control over how you build your pipeline logic.

What about a custom UI?

The majority of applications built using this accelerator will be integrated into existing software platforms such as those use in call centres, customer support, case management, ERP platforms and more. Integrating with these platforms typically requires an API call or an event-driven database/blob trigger so that any processing done by this accelerator can seamlessly integrate with any existing workflows and processes (e.g. to trigger escalations, human reviews, automated emails and more).

While a demo application is included in this repository for testing your pipelines, the accelerator is built to prioritise integrations with other software platforms. If you would like a more advanced UI, you can either build your own and have it call the Azure Function that is deployed by this accelerator, or look at other accelerators that may offer more narrow and specialized solutions for specific use cases or types of data.

Can I use existing Azure resources?

Yes - you'll need to modify the Bicep templates to refer to existing resources instead of creating new ones. See here for more info.

How can I integrate with other triggers?

Deployment

Pricing considerations

This solution accelerator deploys multiple resources. Evaluate the cost of each component prior to deployment.

The following are links to the pricing details for some of the resources:

Deploying to Azure with azd

All instructions are written for unix-based systems (Linux/MacOS). While Windows instructions are coming soon, you can use Windows Subsystem for Linux (WSL) to execute the following commands from the Linux command line.

Prerequisites

To customize and develop the app locally, you will need to install the following:

Deploying for the first time

Execute the following command, if you don't have any pre-existing Azure services and want to start from a fresh deployment.

  1. Run azd auth login
  2. Review the default parameters in infra/main.bicepparam and update as required.
  3. Run azd up - This will provision the Azure resources and deploy the services.
    • Note: When deploying for the first time, you may receive a ServiceUnavailable error when attempting to deploy the apps after provisioning. If this error occurs, simple rerun azd deploy after 1-2 minutes.
  4. After the application has been successfully deployed you will see the Function App and Web App URLs printed to the console. Open the Web App URL to interact with the demo pipelines from your browser. It will look like the following:

Deployed endpoints

Note that the Function app is deployed on a consumption plan under the default infrastructure configuration. This means the first request after deployment or periods of inactivity will take 20-30 seconds longer while the function warms up. All requests made once the function is warm should complete in a normal timeframe.

Deploying again

If you've only changed the function or web app code, then you don't need to re-provision the Azure resources. You can just run:

azd deploy --all or azd deploy api or azd deploy webapp

If you've changed the infrastructure files (infra folder or azure.yaml), then you'll need to re-provision the Azure resources and redeploy the services. You can do that by running:

azd up

Clean up

To clean up all the resources created by this sample:

  1. Remove any model deployments within the AOAI resource. If not removed, these may the resource cleanup to fail.
  2. Run azd down --purge. This will permanently delete the resource group and all resources.

Running the solution locally

Prerequisite - Deploy Azure Resources

To run the solution locally, you will need to create the necessary resources for all Azure AI service calls (e.g. Azure Document Intelligence, Azure OpenAI etc). Set these up within Azure before you start the next steps.

Function app local instructions

The function_app folder contains the backend Azure Functions App. By default, it includes a number of Azure Function Python Blueprints that showcase different types of processing pipelines.

Once the local web server is up and running, you can either run the demo app, or open a new terminal window and send a test request: sh send_req_summarize_text.sh

Demo app local instructions

the demo_app folder contains the code for the demo web application. This application is built with the gradio Python web framework, and is meant for demos and testing (not production usage). The application connects to the Azure Functions server for all processing, automatically selecting the correct endpoint based on environment variables (which are set during deployment). If the server is run locally without any environment variables set, it will connect to the Function Server on http://localhost:7071/, otherwise it will use the FUNCTION_HOST and FUNCTION_KEY environment variables to connect to the Azure Function deployed within Azure.

Credits

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.