Azure-Samples / azure-search-openai-demo

A sample app for the Retrieval-Augmented Generation pattern running in Azure, using Azure AI Search for retrieval and Azure OpenAI large language models to power ChatGPT-style and Q&A experiences.
https://azure.microsoft.com/products/search
MIT License
5.57k stars 3.74k forks source link

Differentiating content searched per sub-company without using authentication #1693

Closed EMjetrot closed 2 weeks ago

EMjetrot commented 2 weeks ago

Please provide us with the following information:

This issue is for a: (mark with an x)

- [ ] bug report -> please search issues before submitting
- [X] feature request
- [ ] documentation issue or request
- [ ] regression (a behavior that used to work and stopped in a new release)

Minimal steps to reproduce

Any log messages given by the failure

Expected/desired behavior

I'm working on a HR-chatbot using this excellent repo, but I'm looking for a way to differentiate the content (pdf's) the chatbot searches, based on the sub-company the user is from. I've read the document Setting up optional login and document level access control, but I wan't to avoid authenticating users, in order to keep the users and their conversations anonymous (GDPR requirement). I'll rely on Azure ip-address filtering to keep non-employees out.

I've changed the routing in the app (see below), so that I can give a abbreviation for the organization (sub-company) and use that as parameter in my query to the search index, but I'm hoping someone could tell me which of the two strategies would be the best and fastest to go for?

  1. Adding the company abbreviation to the category field in the search index (using document-api) and change the category filter in this line of approach.py to filter on the company organization (organizationAbbreviation)?
  2. Adding the company abbreviation to the groups field in the search index, which is already used for security filtering, but for authenticated users. Then changing the above mentioned function to send the company abbreviation in the search query.

I'm unsure, whether option 1 will be a strong/reliant enough filter and whether option 2 will be impossible without authentication.

If this use case seems reasonably, i.e. not authenticating users but differentiating the content based on a URL parameter or drop-down in the front end, then I would be thankful for it to be considered as a future feature in this repo.

Here is the rewrite of the routing:

const router = createHashRouter([
    {
        path: "/:organizationAbbreviation",
        element: layout,
        children: [
            {
                index: true,
                element: <Chat />
            },
            {
                path: "*",
                lazy: () => import("./pages/NoPage")
            }
        ]
    }
]);

OS and Version?

Windows 7, 8 or 10. Linux (which distribution). macOS (Yosemite? El Capitan? Sierra?)

azd version?

run azd version and copy paste here.

Versions

Mention any other details that might be useful

pamelafox commented 2 weeks ago

This sounds like a good use case for category, given that you're not authenticating the users. I would suggest adding unit tests or smoke tests to ensure that you're always correctly filtering, given that you're dealing with cross-company information. cc @mattgotteiner if he has other suggestions.

mattgotteiner commented 2 weeks ago

I agree that category-based filtering would work. However, without some kind of authentication it's possible that someone will see content they aren't supposed to. So it's up to you to determine if this risk is acceptable for your use case.

EMjetrot commented 2 weeks ago

Thank you for your answers. I'll go for the category-based filtering then ;)

And yes, in this use case, it doesn't matter if an employee accidentally gets access to a chatbot belonging to another subcompany (i.e. by figuring out the URL routing), because the material is not confidential between subcompanies. It's just outsiders that should not see it.

Have a nice day and thanks again :)