Closed dfhssilva closed 3 years ago
A quick solution that I found was to override the query_by_embedding
method in the OpenDistroElasticsearchDocumentStore
class as the problem originates from the way the method is defined on the parent class ElasticsearchDocumentStore
.
class OpenDistroElasticsearchDocumentStore2(OpenDistroElasticsearchDocumentStore):
def query_by_embedding(self,
query_emb: np.ndarray,
filters: Optional[List[dict]] = None,
top_k: int = 10,
index: Optional[str] = None,
return_embedding: Optional[bool] = None) -> List[Document]:
"""
Find the document that is most similar to the provided `query_emb` by using a vector similarity metric.
:param query_emb: Embedding of the query (e.g. gathered from DPR)
:param filters: Optional filters to narrow down the search space. Follows Open Distro for
Elasticsearch syntax: https://opendistro.github.io/for-elasticsearch-docs/docs/elasticsearch/bool/. Example:
[
{
"terms": {
"author": [
"Alan Silva",
"Mark Costa",
]
}
},
{
"range": {
"timestamp": {
"gte": "01-01-2021",
"lt": "01-06-2021"
}
}
}
]
:param top_k: How many documents to return
:param index: Index name for storing the docs and metadata
:param return_embedding: To return document embedding
:return:
"""
if index is None:
index = self.index
if return_embedding is None:
return_embedding = self.return_embedding
if not self.embedding_field:
raise RuntimeError("Please specify arg `embedding_field` in ElasticsearchDocumentStore()")
else:
# +1 in similarity to avoid negative numbers (for cosine sim)
body = {
"size": top_k,
"query": {
"bool": {
"must": self._get_vector_similarity_query(query_emb, top_k)
}
}
}
if filters:
body["query"]["bool"]["filter"] = filters
excluded_meta_data: Optional[list] = None
if self.excluded_meta_data:
excluded_meta_data = deepcopy(self.excluded_meta_data)
if return_embedding is True and self.embedding_field in excluded_meta_data:
excluded_meta_data.remove(self.embedding_field)
elif return_embedding is False and self.embedding_field not in excluded_meta_data:
excluded_meta_data.append(self.embedding_field)
elif return_embedding is False:
excluded_meta_data = [self.embedding_field]
if excluded_meta_data:
body["_source"] = {"excludes": excluded_meta_data}
logger.debug(f"Retriever query: {body}")
result = self.client.search(index=index, body=body, request_timeout=300)["hits"]["hits"]
documents = [
self._convert_es_hit_to_document(hit, adapt_score_for_embedding=True, return_embedding=return_embedding)
for hit in result
]
return documents
Also, note that I changed the input type of the filters
parameter as previously it only accepted "terms" filters. This is unrelated to this issue and I only did it because I wanted to pass "range" filters. Is this something you might consider adding in the future?
Can reproduce on OpenDistro.
@DavidSilva98 do you want to create a PR for the same? I think it would be useful
I will try to do it later tonight @srevinsaju. I will just include the ability to add filters together with the knn search and not the change related to the filters argument.
Awesome! Thanks a lot ✨
Describe the bug
Specifying a filter in a query made with dense vectors while working with OpenDistroElasticsearchDocumentStore and EmbeddingRetriever throws a
KeyError: 'script_score'.
Error message
Expected behavior I was expecting the following behavior according to this article https://opendistro.github.io/for-elasticsearch/blog/odfe-updates/2020/04/Building-k-Nearest-Neighbor-(k-NN)-Similarity-Search-Engine-with-Elasticsearch/
Additional context I am running haystack inside a Docker-compose + FASTAPI setup very similar to the example in the haystack repo. The query pipeline I am using is:
To Reproduce By running the query pipeline specified above while specifying any kind of filters you will get the error.
System: