deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
17.24k stars 1.89k forks source link

Filters in K-NN similarity search query not working (Open Distro Elastic Search) #1139

Closed dfhssilva closed 3 years ago

dfhssilva commented 3 years ago

Describe the bug

Specifying a filter in a query made with dense vectors while working with OpenDistroElasticsearchDocumentStore and EmbeddingRetriever throws a KeyError: 'script_score'.

Error message

Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/cors.py", line 86, in __call__
    await self.simple_response(scope, receive, send, request_headers=headers)
  File "/usr/local/lib/python3.7/dist-packages/starlette/middleware/cors.py", line 142, in simple_response
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.7/dist-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 580, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 241, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.7/dist-packages/starlette/routing.py", line 52, in app
    response = await func(request)
  File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 202, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/usr/local/lib/python3.7/dist-packages/fastapi/routing.py", line 150, in run_endpoint_function
    return await run_in_threadpool(dependant.call, **values)
  File "/usr/local/lib/python3.7/dist-packages/starlette/concurrency.py", line 40, in run_in_threadpool
    return await loop.run_in_executor(None, func, *args)
  File "/usr/lib/python3.7/concurrent/futures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
  File "/home/user/api/controller/search.py", line 46, in query
    result = _process_request(PIPELINE, request)
  File "/home/user/api/controller/search.py", line 63, in _process_request
    result = pipeline.run(query=request.query, filters=filters)
  File "/usr/local/lib/python3.7/dist-packages/haystack/pipeline.py", line 125, in run
    raise Exception(f"Exception while running node `{node_id}` with input `{node_input}`: {e}, full stack trace: {tb}")
Exception: Exception while running node `Retriever` with input `{'pipeline_type': 'Query', 'query': 'Finance', 'filters': {'additionalProp1': ['string'], 'additionalProp2': ['string'], 'additionalProp3': ['string']}}`: 'script_score', full stack trace: Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/haystack/pipeline.py", line 122, in run
    node_output, stream_id = self.graph.nodes[node_id]["component"].run(**node_input)
  File "/usr/local/lib/python3.7/dist-packages/haystack/retriever/base.py", line 180, in run
    output, stream = run_query_timed(**kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haystack/retriever/base.py", line 43, in wrapper
    ret = fn(*args, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/haystack/retriever/base.py", line 197, in run_query
    documents = self.retrieve(query=query, filters=filters, top_k=top_k_retriever, index=index)
  File "/usr/local/lib/python3.7/dist-packages/haystack/retriever/dense.py", line 524, in retrieve
    top_k=top_k, index=index)
  File "/usr/local/lib/python3.7/dist-packages/haystack/document_store/elasticsearch.py", line 705, in query_by_embedding
    body["query"]["script_score"]["query"] = {"bool": {"filter": filter_clause}}
KeyError: 'script_score'

Expected behavior I was expecting the following behavior according to this article https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/

You can also combine the knn query clause with other query clauses as you would normally do with compound queries. In the example provided, the user first runs the knn query to find the closest five neighbors (k=5) to the vector [3,4] and then applies post filter to the results using the boolean query to focus on items that are priced less than 15 units.

POST /myindex/_search
{
  "size": 5,
  "query": {
    "bool": {
      "must": {
        "knn": {
          "my_vector": {
            "vector": [3, 4],
            "k": 5
          }
        }
      },
      "filter": {
        "range": {
          "price": {
            "lt": 15
          }
        }
      }
    }
  }
}

Additional context I am running haystack inside a Docker-compose + FASTAPI setup very similar to the example in the haystack repo. The query pipeline I am using is:

components: 
  - name: DocumentStore
    type: OpenDistroElasticsearchDocumentStore
    params:
      host: odfe-node1
      port: 9200
      username: admin
      password: admin
      scheme: https
      verify_certs: False
      similarity: cosine
      return_embedding: True
  - name: Retriever
    type: EmbeddingRetriever
    params:
      document_store: DocumentStore
      embedding_model: deepset/sentence_bert
      top_k: 100
  - name: Reader
    type: CrossEncoderReRanker
    params:
      cross_encoder: cross-encoder/ms-marco-TinyBERT-L-6
      top_k: 10

pipelines:
  - name: query
    type: Query
    nodes:
      - name: Retriever
        inputs: [Query]
      - name: Reader
        inputs: [Retriever]

To Reproduce By running the query pipeline specified above while specifying any kind of filters you will get the error.

System:

dfhssilva commented 3 years ago

A quick solution that I found was to override the query_by_embedding method in the OpenDistroElasticsearchDocumentStore class as the problem originates from the way the method is defined on the parent class ElasticsearchDocumentStore.

class OpenDistroElasticsearchDocumentStore2(OpenDistroElasticsearchDocumentStore):
    def query_by_embedding(self,
                            query_emb: np.ndarray,
                            filters: Optional[List[dict]] = None,
                            top_k: int = 10,
                            index: Optional[str] = None,
                            return_embedding: Optional[bool] = None) -> List[Document]:
            """
            Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
            :param query_emb: Embedding of the query (e.g. gathered from DPR)
            :param filters: Optional filters to narrow down the search space. Follows Open Distro for 
            Elasticsearch syntax: https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/bool/. Example: 
                [
                    {
                        "terms": {
                            "author": [
                                "Alan Silva", 
                                "Mark Costa",
                            ]
                        }
                    },
                    {
                        "range": {
                            "timestamp": {
                                "gte": "01-01-2021",
                                "lt": "01-06-2021" 
                            }
                        }
                    }
                ]
            :param top_k: How many documents to return
            :param index: Index name for storing the docs and metadata
            :param return_embedding: To return document embedding
            :return:
            """
            if index is None:
                index = self.index

            if return_embedding is None:
                return_embedding = self.return_embedding

            if not self.embedding_field:
                raise RuntimeError("Please specify arg `embedding_field` in ElasticsearchDocumentStore()")
            else:
                # +1 in similarity to avoid negative numbers (for cosine sim)
                body = {
                    "size": top_k,
                    "query": {
                        "bool": {
                            "must": self._get_vector_similarity_query(query_emb, top_k)
                        }
                    }
                }
                if filters:
                    body["query"]["bool"]["filter"] = filters

                excluded_meta_data: Optional[list] = None

                if self.excluded_meta_data:
                    excluded_meta_data = deepcopy(self.excluded_meta_data)

                    if return_embedding is True and self.embedding_field in excluded_meta_data:
                        excluded_meta_data.remove(self.embedding_field)
                    elif return_embedding is False and self.embedding_field not in excluded_meta_data:
                        excluded_meta_data.append(self.embedding_field)
                elif return_embedding is False:
                    excluded_meta_data = [self.embedding_field]

                if excluded_meta_data:
                    body["_source"] = {"excludes": excluded_meta_data}

                logger.debug(f"Retriever query: {body}")
                result = self.client.search(index=index, body=body, request_timeout=300)["hits"]["hits"]

                documents = [
                    self._convert_es_hit_to_document(hit, adapt_score_for_embedding=True, return_embedding=return_embedding)
                    for hit in result
                ]
                return documents

Also, note that I changed the input type of the filters parameter as previously it only accepted "terms" filters. This is unrelated to this issue and I only did it because I wanted to pass "range" filters. Is this something you might consider adding in the future?

srevinsaju commented 3 years ago

Can reproduce on OpenDistro.

@DavidSilva98 do you want to create a PR for the same? I think it would be useful

dfhssilva commented 3 years ago

I will try to do it later tonight @srevinsaju. I will just include the ability to add filters together with the knn search and not the change related to the filters argument.

srevinsaju commented 3 years ago

Awesome! Thanks a lot ✨