pamelafox commented 4 months ago

This is the proposed design for an optional feature that would allow users to upload documents, and for those documents to be queryable alongside the general documents. Users should only be able to search on their own documents.

Architecture

We will create a new Blob container of the Azure DataLake Storage Gen2 variety, so that we can associate ACLs with the content inside it. That requires a new storage account in the Bicep, with the isHsnEnable parameter toggled, and corresponding environment variables for the account name and container name.

Frontend

The /config route will respond with a boolean indicating whether the userUpload feature is enabled. If it's enabled and authentication is enabled, then the frontend will have an "Upload files" button.

The "upload files" button will upload a single file and sends a POST request with multi-part form data to the /upload route on the backend.

App backend

The setup_clients function will set up the necessary DataLakeServiceClient per the environment variables.

The /upload route will:

Parse the file from the payload
Use the existing @authenticated decorator to get the auth_claims, extracting the user ID of the logged in user
Check if a directory exists yet for that user ID using the Data Lake SDK
If not, it will create that directory and set the ACL.
Once it exists, it will upload the file to that container, and also store the oid as a tag on the blob, for some redundancy.

Ingestion function

If user upload feature is enabled, we will provision and deploy a new Python function with a blob trigger

The function will use the prepdocs library to ingest the target blob, per the environment variables of the function. The prepdocs library will either need to be copied as part of the prepackage step, or the function code itself will live inside scripts. Or we will reorganize the whole repo. :( We have to see what will work best and what will enable local development as well. We could even make prepdocs a pypi package, but that requires a long approval process and will make it harder for developers to quickly modify.

We have to decide whether we will put the function as a commented out service in azure.yaml, or if we use a postprovision script to deploy the function, as azd does not yet support optional deployments. We can at least provision the function in main.bicep, however.

Why not use integrated vectorization for ingestion?

We would need a custom skill to associate the user oid with the document, which would itself be a new function (so we'd have to write a function regardless). The vectorization indexer works on a schedule, at most every 5 minutes, so the document would not be available for immediate querying.

The main drawback for a developer who has integrated vectorization enabled is that there may be slight differences in chunking between the prepdocs code and the AI Search skillsets, but we can make the prepdocs splitter a skillset in future if needed.

ADLS2 vs Blob Storage

Advantages of using ADLS2 vs Blob storage:

It is straightforward to delete the files for a user by deleting the whole folder
It is built with ACL in mind so could integrate more easily with other ACL'd solutions

Disadvantages:

Requires an additional SDK
Some differences in feature support

Open questions

Do we need to give users a "delete" functionality in the UI? I would not start with that as users have a tendency to accidentally delete (even with confirmations) so you have to implement a waiting period before true deletion. Admins can do the deletion when requested, for now.
Need change to store MD5 in Blob storage, modifying that PR
Add an azure.yaml function for prepdocs lib with a blob trigger
- Add a script to deploy optional function
- Can it be commented out?

pamelafox commented 4 months ago

FYI @mattgotteiner, this is a write-up of what we discussed. Let me know if anything seems incorrect. I am tackling this now.

mattgotteiner commented 4 months ago

Yes, LGTM

ntabernacle commented 4 months ago

Thank you for looking at this, it's appreciated.

The use case in our scenario is much like ChatGPT where the upload is only valid for that session. The engineering team in our enterprise will be uploading 99% of the data through the portal but this upload is for one off analysis by non engineering staff users.

The ACL solution will be great to get it signed off by the security team (I think this is how copilot works, which we are also using) and we can always run a task to clean out the user uploads every so often.

jaspb commented 4 months ago

Have been looking forward to this feature for a while now. Happy to see it implemented by you @pamelafox . Thanks in advance! I am curious to see how you will reorganise prepdocs to make it work inside an Azure function. I considered to publish it as an artifact inside our project to make it pip-installable inside the function so to avoid having to refactor too much or even copy code inside the function folder.

pamelafox commented 4 months ago

@jaspb I ended up going with a simpler approach of actually moving prepdocs entirely inside app/backend, to avoid that issue as well as avoid the complexity of polling for ingestion status. You can see that in the PR linked from here- https://github.com/Azure-Samples/azure-search-openai-demo/pull/1395

That approach seems to work so far, though it may have limitations in terms of file size. If we do run into issues with file size, we could also consider setting up celery on App Service to handle ingestion jobs and retries.

charlie-hssrr commented 4 months ago

Here's an approach I'm using with Redis message queues for file uploads and processing with prepdocs.sh:

File upload to retrofit with this repo

sdvies commented 3 months ago

@charlie-hssrr does it works in the app service? If it does, can I use simultaneously with another person?

charlie-hssrr commented 3 months ago

@sdvies You can run the project standalone and it will allow for uploads. For firing prepdocs.sh, you need to either retrofit it into the azure demo app, or figure out how to access the demo-project's scripts directory to fire prepdocs for blob processing.

Azure-Samples / azure-search-openai-demo

Design for secure user document upload feature #1393

Architecture

Frontend

App backend

Ingestion function

Why not use integrated vectorization for ingestion?

ADLS2 vs Blob Storage

Open questions