Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.64k stars 3.78k forks source link

Design for secure user document upload feature #1393

Open pamelafox opened 4 months ago

pamelafox commented 4 months ago

This is the proposed design for an optional feature that would allow users to upload documents, and for those documents to be queryable alongside the general documents. Users should only be able to search on their own documents.

Architecture

We will create a new Blob container of the Azure DataLake Storage Gen2 variety, so that we can associate ACLs with the content inside it. That requires a new storage account in the Bicep, with the isHsnEnable parameter toggled, and corresponding environment variables for the account name and container name.

Frontend

The /config route will respond with a boolean indicating whether the userUpload feature is enabled. If it's enabled and authentication is enabled, then the frontend will have an "Upload files" button.

The "upload files" button will upload a single file and sends a POST request with multi-part form data to the /upload route on the backend.

App backend

The setup_clients function will set up the necessary DataLakeServiceClient per the environment variables.

The /upload route will:

Ingestion function

If user upload feature is enabled, we will provision and deploy a new Python function with a blob trigger

The function will use the prepdocs library to ingest the target blob, per the environment variables of the function. The prepdocs library will either need to be copied as part of the prepackage step, or the function code itself will live inside scripts. Or we will reorganize the whole repo. :( We have to see what will work best and what will enable local development as well. We could even make prepdocs a pypi package, but that requires a long approval process and will make it harder for developers to quickly modify.

We have to decide whether we will put the function as a commented out service in azure.yaml, or if we use a postprovision script to deploy the function, as azd does not yet support optional deployments. We can at least provision the function in main.bicep, however.

Why not use integrated vectorization for ingestion?

We would need a custom skill to associate the user oid with the document, which would itself be a new function (so we'd have to write a function regardless). The vectorization indexer works on a schedule, at most every 5 minutes, so the document would not be available for immediate querying.

The main drawback for a developer who has integrated vectorization enabled is that there may be slight differences in chunking between the prepdocs code and the AI Search skillsets, but we can make the prepdocs splitter a skillset in future if needed.

ADLS2 vs Blob Storage

Advantages of using ADLS2 vs Blob storage:

Disadvantages:

Open questions

pamelafox commented 4 months ago

FYI @mattgotteiner, this is a write-up of what we discussed. Let me know if anything seems incorrect. I am tackling this now.

mattgotteiner commented 4 months ago

Yes, LGTM

ntabernacle commented 4 months ago

Thank you for looking at this, it's appreciated.

The use case in our scenario is much like ChatGPT where the upload is only valid for that session. The engineering team in our enterprise will be uploading 99% of the data through the portal but this upload is for one off analysis by non engineering staff users.

The ACL solution will be great to get it signed off by the security team (I think this is how copilot works, which we are also using) and we can always run a task to clean out the user uploads every so often.

jaspb commented 4 months ago

Have been looking forward to this feature for a while now. Happy to see it implemented by you @pamelafox . Thanks in advance! I am curious to see how you will reorganise prepdocs to make it work inside an Azure function. I considered to publish it as an artifact inside our project to make it pip-installable inside the function so to avoid having to refactor too much or even copy code inside the function folder.

pamelafox commented 4 months ago

@jaspb I ended up going with a simpler approach of actually moving prepdocs entirely inside app/backend, to avoid that issue as well as avoid the complexity of polling for ingestion status. You can see that in the PR linked from here- https://github.com/Azure-Samples/azure-search-openai-demo/pull/1395

That approach seems to work so far, though it may have limitations in terms of file size. If we do run into issues with file size, we could also consider setting up celery on App Service to handle ingestion jobs and retries.

charlie-hssrr commented 4 months ago

Here's an approach I'm using with Redis message queues for file uploads and processing with prepdocs.sh:

File upload to retrofit with this repo

sdvies commented 3 months ago

@charlie-hssrr does it works in the app service? If it does, can I use simultaneously with another person?

charlie-hssrr commented 3 months ago

@sdvies You can run the project standalone and it will allow for uploads. For firing prepdocs.sh, you need to either retrofit it into the azure demo app, or figure out how to access the demo-project's scripts directory to fire prepdocs for blob processing.