MicrosoftDocs / azure-docs

Open source documentation of Microsoft Azure
https://docs.microsoft.com/azure
Creative Commons Attribution 4.0 International
10.29k stars 21.48k forks source link

Vector Search w/ Filter #114474

Closed cticevans closed 1 year ago

cticevans commented 1 year ago

The document says: Filtered vector search. A query request can include a vector query and a filter expression. Filters apply to text and numeric fields, and are useful for including or excluding search documents based on filter criteria. Although a vector field isn't filterable itself, you can set up a filterable text or numeric field. The search engine processes the filter first, reducing the surface area of the search corpus before running the vector query.

This said, I don't believe this is working as one would expect based on this statement. In my tests the filter can affect the vector search such that the top k is not respected. I have a populated index with 5 docs that match the filter but with a k=5, i sometimes only get 1 or 2 documents back (there are many docs that don't match the filter) -- if I increase k to 500+, I then get back 5 documents. this isn't what one would expect.


Document Details

Do not edit this section. It is required for learn.microsoft.com ➟ GitHub issue linking.

ManoharLakkoju-MSFT commented 1 year ago

@cticevans Thanks for your feedback! We will investigate and update as appropriate.

robertklee commented 1 year ago

@cticevans this is a great question that I think is worthwhile of making a stackoverflow question about to help the community. Could you please clone your question there and tag me with the SO question's url? https://stackoverflow.com/questions/tagged/azure-cognitive-search

This behavior stems from the time that the filter is applied. The current behavior for filtering with HNSW is "post filtering", which means the set of k approximately NN is retrieved and then combined with the set of filtered results. For highly selective filters, this can cause the result set to have too few results that satisfy the filter. For low values of k, this would reduce the result set on which filtering is applied. In both scenarios, filtering can cause fewer than k results.

You can either increase the requested k value to have a higher set of candidate results before applying the filter, or wait for support of "pre-filtering" natively.

We'll improve the docs to explain this behavior since the current wording of that paragraph is inaccurate. Specifically, this is incorrect for post-filtering: The search engine processes the filter first, reducing the surface area of the search corpus before running the vector query.

cticevans commented 1 year ago

Appreciate the confirmation – I was 99% sure what I was seeing but knowing for sure is helpful. My workaround is a much larger K and then using Top – it works for now but certainly not optimal.

It might be worth clarifying in the docs. Some docs sound like vector search will always return K (even with filter).

Chris Evans Chief Technology Officer Confluence | Pittsburgh Tel. +1 (412) 877 2306

From: Robert Lee @.> Sent: Friday, September 8, 2023 11:18 AM To: MicrosoftDocs/azure-docs @.> Cc: Chris R. Evans @.>; Mention @.> Subject: Re: [MicrosoftDocs/azure-docs] Vector Search w/ Filter (Issue #114474)

@cticevanshttps://github.com/cticevans this is a great question that I think is worthwhile of making a stackoverflow question about to help the community. Could you please clone your question there and tag me with the SO question's url? https://stackoverflow.com/questions/tagged/azure-cognitive-search

This behavior stems from the time that the filter is applied. The current behavior for filtering with HNSW is "post filtering" which means the set of k approximately NN is first retrieved and then filtered. For highly selective filters, this can cause the result set to have too few results that satisfy the filter. For low values of k, this would reduce the result set on which filtering is applied. In both scenarios, filtering can cause fewer than k results.

You can either increase the requested k value to have a higher set of candidate results before applying the filter, or wait for support of "pre-filtering" natively.

We'll improve the docs to explain this behavior since the current wording of that paragraph is inaccurate. Specifically, this is incorrect: The search engine processes the filter first, reducing the surface area of the search corpus before running the vector query.

— Reply to this email directly, view it on GitHubhttps://github.com/MicrosoftDocs/azure-docs/issues/114474#issuecomment-1711835325, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6CAGLB43MY2MX3YUTBUQWTXZMZLNANCNFSM6AAAAAA4PY75TM. You are receiving this because you were mentioned.Message ID: @.**@.>>

CONFIDENTIALITY NOTE: The information contained in this transmission is intended only for the use of the individual or entity named above and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this transmission to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, please reply immediately to the sender that the communication was received in error and then immediately delete it and all of its attachments, without retaining any hard or soft copies of same. Thank you.

cticevans commented 1 year ago

url attached: Problem w/ Azure Cognitive Search Vector Search + Filter - Stack Overflowhttps://stackoverflow.com/questions/77068991/problem-w-azure-cognitive-search-vector-search-filter

Chris Evans Chief Technology Officer Confluence | Pittsburgh Tel. +1 (412) 877 2306

From: Robert Lee @.> Sent: Friday, September 8, 2023 11:18 AM To: MicrosoftDocs/azure-docs @.> Cc: Chris R. Evans @.>; Mention @.> Subject: Re: [MicrosoftDocs/azure-docs] Vector Search w/ Filter (Issue #114474)

@cticevanshttps://github.com/cticevans this is a great question that I think is worthwhile of making a stackoverflow question about to help the community. Could you please clone your question there and tag me with the SO question's url? https://stackoverflow.com/questions/tagged/azure-cognitive-search

This behavior stems from the time that the filter is applied. The current behavior for filtering with HNSW is "post filtering" which means the set of k approximately NN is first retrieved and then filtered. For highly selective filters, this can cause the result set to have too few results that satisfy the filter. For low values of k, this would reduce the result set on which filtering is applied. In both scenarios, filtering can cause fewer than k results.

You can either increase the requested k value to have a higher set of candidate results before applying the filter, or wait for support of "pre-filtering" natively.

We'll improve the docs to explain this behavior since the current wording of that paragraph is inaccurate. Specifically, this is incorrect: The search engine processes the filter first, reducing the surface area of the search corpus before running the vector query.

— Reply to this email directly, view it on GitHubhttps://github.com/MicrosoftDocs/azure-docs/issues/114474#issuecomment-1711835325, or unsubscribehttps://github.com/notifications/unsubscribe-auth/A6CAGLB43MY2MX3YUTBUQWTXZMZLNANCNFSM6AAAAAA4PY75TM. You are receiving this because you were mentioned.Message ID: @.***>

CONFIDENTIALITY NOTE: The information contained in this transmission is intended only for the use of the individual or entity named above and may contain information that is privileged and confidential. If the reader of this message is not the intended recipient or the employee or agent responsible for delivering this transmission to the intended recipient, you are hereby notified that any dissemination, distribution or copying of this communication is strictly prohibited. If you have received this transmission in error, please reply immediately to the sender that the communication was received in error and then immediately delete it and all of its attachments, without retaining any hard or soft copies of same. Thank you.

robertklee commented 1 year ago

Thanks for bringing this to our attention. We'll improve our docs to clarify this case. :)

It might be worth clarifying in the docs. Some docs sound like vector search will always return K (even with filter).

robertklee commented 1 year ago

We have a PR to fix the docs and it should be updated in the public docs over the next few days.

cticevans commented 1 year ago

are there any plans to implement this sort of pre-filter. i am essentially abandoning azure search and instead building local transient chromadbs based on the filters. this works for my application but certainly wouldn't be an option at scale or with more complicated filtering.

HeidiSteen commented 1 year ago

@cticevans, I noticed that Robert mentioned this earlier in the thread: "You can either increase the requested k value to have a higher set of candidate results before applying the filter, or wait for support of "pre-filtering" natively." I don't know what your timeline is. @robertklee is in a better position to comment.

robertklee commented 1 year ago

Thanks @HeidiSteen for the tag. @cticevans let me check with our PM on guidance on announcing timelines in a public forum. :)

robertklee commented 1 year ago

@cticevans as a heads up, due to engineering limitations, pre-filtering will only be supported on new indexes, meaning you'll need to re-create your index to use pre-filtering feature after the feature is rolled out. Our PM team can share more details about our upcoming release timelines.

cticevans commented 1 year ago

thanks @robertklee. i'd happily rebuild indexes for the capability. thanks

robertklee commented 1 year ago

@cticevans this pre-filtering capability is coming soon. 😄 Unfortunately, since this is a public forum, that's all the details we can share right now.

HeidiSteen commented 1 year ago

I'm going to close this issue now since the doc issues have been addressed, but @ctievans, the What's New page will carry the announcement once it's ready.

please-close

HeidiSteen commented 1 year ago

@cticevans Prefiltering is now available per the 2023-10-01-Preview REST APIs. The beta Azure SDK libraries should also have been updated, but I'm still checking the versions for that. Prefiltering requires a new-ish index, so if the index is older than Oct 1, you would need a new one: https://learn.microsoft.com/en-us/azure/search/vector-search-how-to-query?tabs=query-2023-10-01-Preview%2Cfilter-2023-10-01-Preview#vector-query-with-filter

cticevans commented 1 year ago

Thanks @HeidiSteen. I'm a little confused by the documentation. It says "In contrast with full text search, a filter in a pure vector query is effectively processed as a post-query operation. The set of "k" nearest neighbors is retrieved, and then combined with the set of filtered results. As such, the value of "k" predetermines the surface over which the filter is applied. For "k": 10, the filter is applied to 10 most similar documents. For "k": 100, the filter iterates over 100 documents (assuming the index contains 100 documents that are sufficiently similar to the query)."

Is this the OLD behavior or continues to be the behavior even with the new pre-filter?

HeidiSteen commented 1 year ago

@cticevans, I missed that sentence when updating the docs :( Thanks for pointing it out -- I'll fix it now. Prefiltering is supported, and its the new default, but you have to switch up to the new REST API version, 2023-10-01-Preview, to get those behaviors.