deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.81k stars 1.84k forks source link

Cannot use EmbeddingRetriever (Indexing of embeddings) #116

Closed ierezell closed 4 years ago

ierezell commented 4 years ago

I want to use an EmbeddingRetriever instead of a BM25.

I have a huge handcrafted corpus of text files, each corresponding to a paragraph.

The goal is to ask a question, get the k best paragraph regarding the embeddings similarity (text files and question). I've done my own method but I would like to compare with Haystack.

I'm creating the DocumentStore as in the doc :

document_store = ElasticsearchDocumentStore(host="localhost", username="",  password="", index="document", text_field="answer", embedding_field="question_emb", embedding_dim=768, excluded_meta_data=["question_emb"])
    write_documents_to_db(document_store=document_store, only_empty_db=True, document_dir="./datas/txt")

Then the retriever (multilang sentence transformer):

retriever = EmbeddingRetriever(document_store=document_store, embedding_model="distiluse-base-multilingual-cased", model_format="sentence_transformers")

And finnaly when I try to retrieve the top k documents :

top_docs = retriever.retrieve(query=q, top_k=10)

I got this info :

05/19/2020 10:22:49 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:400 request:0.008s]
05/19/2020 10:22:49 - INFO - elasticsearch -   POST http://localhost:9200/_count [status:200 request:0.003s]
05/19/2020 10:22:49 - INFO - haystack.indexing.io -   Skip writing documents since DB already contains 2141 docs ...  (Disable `only_empty_db`, if you want to add docs anyway.)
05/19/2020 10:22:49 - INFO - haystack.retriever.elasticsearch -   Init retriever using embeddings of model distiluse-base-multilingual-cased
05/19/2020 10:22:49 - INFO - root -   Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
05/19/2020 10:22:49 - INFO - root -   Did not find a '/' or '\' in the name. Assume to download model from server.
05/19/2020 10:22:49 - INFO - root -   Load SentenceTransformer from folder: /home/pedro/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip
05/19/2020 10:22:51 - INFO - root -   Use pytorch device: cuda
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 148.86it/s]

Then a traceback

Traceback (most recent call last):
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 50, in dumps
    return json.dumps(
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/__init__.py", line 398, in dumps
    return cls(
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 36, in default
    raise TypeError("Unable to serialize %r (type: %s)" % (data, type(data)))
TypeError: Unable to serialize -0.065520875 (type: <class 'numpy.float32'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sdk/accuracy_retriever.py", line 75, in <module>
    top_docs = haystack_retriever(q, doc_store)
  File "/mnt/Documents/Projets/BotPress/R_D/R_D_q_a/sdk/retrievers.py", line 113, in haystack_retriever
    top_docs = retriever.retrieve(query=q, top_k=10)
  File "/mnt/Documents/Projets/git_clones/haystack/haystack/retriever/elasticsearch.py", line 92, in retrieve
    documents = self.document_store.query_by_embedding(query_emb[0], top_k, candidate_doc_ids)
  File "/mnt/Documents/Projets/git_clones/haystack/haystack/database/elasticsearch.py", line 184, in query_by_embedding
    result = self.client.search(index=self.index, body=body)["hits"]["hits"]
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/client/__init__.py", line 1622, in search
    return self.transport.perform_request(
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/transport.py", line 321, in perform_request
    body = self.serializer.dumps(body)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 54, in dumps
    raise SerializationError(data, e)
elasticsearch.exceptions.SerializationError: ({'size': 10, 'query': {'script_score': {'query': {'match_all': {}}, 'script': {'source': "cosineSimilarity(params.query_vector,doc['question_emb']) + 1.0", 'params': {'query_vector': [-0.065520875, 0.023728848, ... lot of numbers ..., 0.047961414]}}}}, '_source': {'excludes': ['question_emb']}}, TypeError("Unable to serialize -0.065520875 (type: <class 'numpy.float32'>)"))
tholor commented 4 years ago

Hey @Ierezell ,

For using the EmbeddingRetriever you'll need to add the embeddings for your documents to the Elasticsearch index first. At query time we can then embed the question on-the-fly and calculate similarity to the indexed embeddings.

Please have a look at Tutorial 4, where we do this for "FAQ style QA". If you wanted to apply this to regular extractive QA, you would need to adjust the "write_documents_to_db()" utility function to also create the embeddings.

We'll probably simplify this and add an example once we have implemented the DPR encoders (#63). Be aware that using sentence-transformers / USE embeddings for both (question and long passages) might not yield great performance. Dual encoder approaches usually work better here.

ierezell commented 4 years ago

Oh sorry I thought the write_to_db would get each sentence and embed it when initialising the db. I will embed them myself first, thanks.

Yes indeed, to overcome that I splitted my dataset in chunks. Because I'm looking for similarity on the content, chunking is okay and just lead to more documents. Furthermore, I'm doing unsupervised lookup so I don't have pairs to train an Dual encoder, but thanks a lot for the suggestion, it's a really good one :smiley:

tholor commented 4 years ago

Besides your issue of indexing the embeddings first, there was still a small issue with the serialization of float64 returned by sentence-transformers. Fixed this in #121.

tholor commented 4 years ago

Closing this as the original issue seems to be fixed. @Ierezell feel free to re-open in case there's still a problem on your side.