name: Azure Multimodal AI & LLM Processing Accelerator (Python) description: Build data processing pipelines with Azure AI Services + LLMs languages:
This accelerator is as a customizable code template for building and deploying production-grade data processing pipelines that incorporate Azure AI services and Azure OpenAI/AI Studio LLM models. It uses a variety of data pre-processing and enrichment components to make it easy to build complex, reliable and accurate pipelines that solve real-world use cases. If you'd like to use AI to summarize, classify, extract or enrich your data with structured and reliable outputs, this is the code repository for you.
It is recommended to review the main repo before pulling new changes, as work is in progress to replace many of the third-party components (e.g. those imported from Haystack) with more complete, performant & fully-featured components. Once the core application is stable, a standard release pattern with semantic versioning will be used to manage releases.
azd
.Most organisations have a huge number of simple and tasks and processes that consume large amounts of time and energy. These could be things like classifying and extracting information from documents, summarizing and triaging customer emails, or transcribing and running compliance tasks on contact centre call recordings. While some of these tasks can be automated with existing tools and services, they often require a lot of up-front investment to fully configure and customize in order to have a reliable, working solution. They can also be perform poorly when dealing with input data that is slightly different than expected, and may never be the right fit for scenarios that require the solution to be flexible or adaptable.
On the other hand, Large Language Models have emerged as a powerful and general-purpose approach that is able to handle these complex and varied situations. And more recently, with the move from text-only models to multimodal models that can incorporate text, audio and video, they are a powerful tool that we can use to automate a wide variety of everyday tasks. But while LLMs are powerful and flexible, they have their own shortcomings when it comes to providing precise and reliable outputs, and they too can be sensitive to the quality of raw and unprocessed input data.
Approach + Examples | Strengths | Weaknesses |
---|---|---|
Domain-specific AI models - OCR - Speech-to-text - Object detection |
- Generally better performance on specialized tasks - Consistent performance and output format - Cost-efficient & Scalable |
- Outputs may require translation into human-friendly format - Larger up-front development cost - Tuning & customization may be more time-consuming - Customized models may be less reliable/flexible with unseen data |
Large Language Models - Azure OpenAI - Open Source Models |
- Define behaviour with natural language - Shorter up-front development time - More flexible with wide-ranging input data - Outputs are in human-friendly format |
- Non-deterministic & lower reliability of outputs - Harder to constrain & test (a black box) - No consistent, structured metadata by default - Uncalibrated, no confidence score (when can the outputs be trusted?) - Expensive and slow, especially for transcription/translation tasks |
This accelerator provides the tools and patterns required to combine the best of both worlds in your production workloads, giving you the reasoning power, flexibility and development speed of Large Language Models, while using domain-specific AI during pre and post-processing to increase the consistency, reliability, cost-efficiency of the overall system.
Here is an example of the pre-built Form Field Extraction pipeline. By combining the structured outputs from Azure Document Intelligence with GPT-4o, we can verify and enrich the values extracted by GPT-4o with confidence scores, bounding boxes, style and more. This allows us to make sure the LLM has not hallucinated, and allows us to automatically flag the document for human review if the confidence scores do not meet our minimum criteria (in this case, all values must have a Document Intelligence confidence score above 80% to avoid human review).
In a recent customer project that involved extracting Order IDs from scanned PDFs and phone images, we used a number of these techniques to increase the performance of GPT-4o-alone from ~60% to near-perfect accuracy:
At the conclusion of this project, our customer was able to deploy the solution and automate the majority of their processing workload with confidence, knowing that any cases that were too challenging for the LLM would automatically be escalated for review. Reviews can now be completed in a fraction of the time thanks to the additional metadata returned with each result.
The accelerator comes with these pre-built pipeline examples to help you get started. Each pipeline is built in its own python file as a function blueprint, and then imported and added to the main function app within function_app/function_app.py
.
Example | Description & Pipeline Steps |
---|---|
Form Field Extraction with Confidence Scores & bboxes (HTTP) Code |
Extracts key information from a PDF form and returns field-level and overall confidence scores and whether human review is required. - PyMuPDF (PDF -> Image) - Document Intelligence (PDF -> text) - GPT-4o (text + image input) - Post-processing:
- Merge Confidence scores and bounding boxes - Determine whether to human review is required |
Call Center Analysis with Confidence Scores & Timestamps (HTTP) Code |
Processes a call center recording, classifying customer sentiment & satisfaction, summarizing the call and next best action, and extracting any keywords mentioned. Returns the response with timestamps, confidence scores and the full sentence text for the next best action and each of the keywords mentioned. - Azure AI Speech (Speech -> Text) - GPT-4o (text input) - Post-processing:
- Merge sentence info & confidence scores |
Form Field Extraction (Blob -> CosmosDB) Code: Func, Pipeline |
Summarizes text input into a desired style and number of output sentences. - Pipeline triggered by blob storage event - PyMuPDF (PDF -> Image) - Document Intelligence (PDF -> text) - GPT-4o (text + image input) - Write structured JSON result to CosmosDB container. |
Summarize Text (HTTP) Code |
Summarizes text input into a desired style and number of output sentences. - GPT-4o (text input + style/length instructions) - Return raw text |
Multimodal Document Intelligence Processing (HTTP) Code |
A pipeline showcasing the highly configurable Document Intelligence Processor that intelligently processes the raw Doc Intelligence API response to extract text, images and tables from a PDF/image into a more usable and flexible format. - Document Intelligence (PDF/image -> text + images + tables) - Return content as Markdown |
City Names Extraction, Doc Intelligence (HTTP) Code |
Uses GPT-4o to extract all city names from a given PDF (using text extracted by Document Intelligence). - Document Intelligence (PDF/image -> text) - GPT-4o (text input) - Return JSON array of city names |
City Names Extraction, PyMuPDF (HTTP) Code |
Uses GPT-4o to extract all city names from a given PDF/image + text (extracted locally by PyMuPDF). - PyMuPDF (PDF/image -> text & images) - GPT-4o (text + image input) - Return JSON array of city names |
These pipelines can be duplicated and customized to your specific use case, and should be modified as required. The pipelines all return a large amount of additional information (such as intermediate outputs from each component, time taken for each step, and the raw source code) which will usually not be required in production use cases. Make sure to review the code thoroughly prior to deployment.
The accelerator comes with an included web app for demo and testing purposes. This webapp is built with Gradio, a lightweight Python UI library, to enable interaction with the backend pipelines from within the browser. The app comes prebuilt with a tab for each of the prebuilt pipelines, along with a few example files for use with each pipeline. The demo app also
Call centre analysis: Transcribe and diarize call centre audio with Azure AI Speech, then use Azure OpenAI to classify the call type, summarize the topics and themes in the call, analyse the sentiment of the customer, and ensure the customer service agent complied with standard procedures (e.g. following the appropriate script, outlining the privacy policy and sending the customer a Product Disclosure Statement).
Document processing: Ingest PDFs, Word documents and scanned images, extract the raw text content with Document Intelligence, then use Azure OpenAI to classify the document by type, extract key fields (e.g. contact information, document ID numbers), classify whether the document was stamped and signed, and return the result in a structured format.
Insurance claim processing: Process all emails and documents in long email chains. Use Azure Document Intelligence to extract information from the attachments, then use Azure OpenAI to generate a timeline of key events in the conversation, determine whether all required documents have been submitted, summarize the current state of the claim, and determine the next-best-action (e.g. auto-respond asking for more information, or escalate to human review for processing).
Customer email processing: Classify incoming emails into categories, summarizing their content, determining the sender's sentiment, and triage into a severity category for human processing.
This accelerator is in active development, with a list of upcoming features including:
To help prioritise these features or request new ones, please head to the Issues section of this repository.
How can I get started with a solution for my own use case?
The demo pipelines are examples and require customization in order to have them work accurately in production. The best strategy to get started is to clone one of the existing demo pipelines and modify them for your own purpose. The following steps are recommended:
function_app/bp_<pipeline_name>.py
) that is most similar to your ideal use case, renaming it and using it as a base to start with.demo_app/
) so that you can easily test the endpoint end-to-end.
Does this repo use or support Langchain/Llamaindex/Framework X?
There are many different frameworks available for LLM/Generative AI applications, each offering different features, integrations, and production suitability. This accelerator uses some existing components from Haystack, but it is framework agnostic and you can use any or all frameworks for your pipelines. This allows you to take advantage of the solution architecture and many of the helper functions while still having full control over how you build your pipeline logic.
What about a custom UI?
The majority of applications built using this accelerator will be integrated into existing software platforms such as those use in call centres, customer support, case management, ERP platforms and more. Integrating with these platforms typically requires an API call or an event-driven database/blob trigger so that any processing done by this accelerator can seamlessly integrate with any existing workflows and processes (e.g. to trigger escalations, human reviews, automated emails and more).
While a demo application is included in this repository for testing your pipelines, the accelerator is built to prioritise integrations with other software platforms. If you would like a more advanced UI, you can either build your own and have it call the Azure Function that is deployed by this accelerator, or look at other accelerators that may offer more narrow and specialized solutions for specific use cases or types of data.
Can I use existing Azure resources?
Yes - you'll need to modify the Bicep templates to refer to existing resources instead of creating new ones. See here for more info.
How can I integrate with other triggers?
This solution accelerator deploys multiple resources. Evaluate the cost of each component prior to deployment.
The following are links to the pricing details for some of the resources:
azd
All instructions are written for unix-based systems (Linux/MacOS). While Windows instructions are coming soon, you can use Windows Subsystem for Linux (WSL) to execute the following commands from the Linux command line.
To customize and develop the app locally, you will need to install the following:
conda
, venv
or virtualenv
to create the environment. Once installed, make sure to have the environment activated when you start running the steps below, as it will be used as the base for isolated environments for the demo and function app.git clone https://github.com/azure/multimodal-ai-llm-processing-accelerator.git
Execute the following command, if you don't have any pre-existing Azure services and want to start from a fresh deployment.
azd auth login
infra/main.bicepparam
and update as required.azd up
- This will provision the Azure resources and deploy the services.
ServiceUnavailable
error when attempting to deploy the apps after provisioning. If this error occurs, simple rerun azd deploy
after 1-2 minutes.Note that the Function app is deployed on a consumption plan under the default infrastructure configuration. This means the first request after deployment or periods of inactivity will take 20-30 seconds longer while the function warms up. All requests made once the function is warm should complete in a normal timeframe.
If you've only changed the function or web app code, then you don't need to re-provision the Azure resources. You can just run:
azd deploy --all
or azd deploy api
or azd deploy webapp
If you've changed the infrastructure files (infra
folder or azure.yaml
), then you'll need to re-provision the Azure resources and redeploy the services. You can do that by running:
azd up
To clean up all the resources created by this sample:
azd down --purge
. This will permanently delete the resource group and all resources.To run the solution locally, you will need to create the necessary resources for all Azure AI service calls (e.g. Azure Document Intelligence, Azure OpenAI etc). Set these up within Azure before you start the next steps.
The function_app
folder contains the backend Azure Functions App. By default, it includes a number of Azure Function Python Blueprints that showcase different types of processing pipelines.
cd function_app
sample_local.settings.json
. More info on local function development can be found here.local.settings.json
file and populate the values for all environment variables (these are usually capitalized, e.g. DOC_INTEL_API_KEY
or AOAI_ENDPOINT
). You may need to go setup new resources within Azure.sh setup_env.sh
.source .venv/bin/activate
func start
Once the local web server is up and running, you can either run the demo app, or open a new terminal window and send a test request:
sh send_req_summarize_text.sh
the demo_app
folder contains the code for the demo web application. This application is built with the gradio
Python web framework, and is meant for demos and testing (not production usage). The application connects to the Azure Functions server for all processing, automatically selecting the correct endpoint based on environment variables (which are set during deployment). If the server is run locally without any environment variables set, it will connect to the Function Server on http://localhost:7071/
, otherwise it will use the FUNCTION_HOST
and FUNCTION_KEY
environment variables to connect to the Azure Function deployed within Azure.
cd demo_app
conda create -n mm_ai_llm_processing_demo_app python=3.11 --no-default-packages && conda activate mm_ai_llm_processing_demo_app
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
cp .env.sample .env
gradio demo_app.py
. By default, the application will launch in auto-reload mode, automatically reloading whenever demo_app.py
is changed.https://localhost:8000
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.
This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.