Cannot use EmbeddingRetriever (Indexing of embeddings)

ierezell commented 4 years ago

I want to use an EmbeddingRetriever instead of a BM25.

I have a huge handcrafted corpus of text files, each corresponding to a paragraph.

The goal is to ask a question, get the k best paragraph regarding the embeddings similarity (text files and question). I've done my own method but I would like to compare with Haystack.

I'm creating the DocumentStore as in the doc :

document_store = ElasticsearchDocumentStore(host="localhost", username="",  password="", index="document", text_field="answer", embedding_field="question_emb", embedding_dim=768, excluded_meta_data=["question_emb"])
    write_documents_to_db(document_store=document_store, only_empty_db=True, document_dir="./datas/txt")

Then the retriever (multilang sentence transformer):

retriever = EmbeddingRetriever(document_store=document_store, embedding_model="distiluse-base-multilingual-cased", model_format="sentence_transformers")

And finnaly when I try to retrieve the top k documents :

top_docs = retriever.retrieve(query=q, top_k=10)

I got this info :

05/19/2020 10:22:49 - INFO - elasticsearch -   PUT http://localhost:9200/document [status:400 request:0.008s]
05/19/2020 10:22:49 - INFO - elasticsearch -   POST http://localhost:9200/_count [status:200 request:0.003s]
05/19/2020 10:22:49 - INFO - haystack.indexing.io -   Skip writing documents since DB already contains 2141 docs ...  (Disable `only_empty_db`, if you want to add docs anyway.)
05/19/2020 10:22:49 - INFO - haystack.retriever.elasticsearch -   Init retriever using embeddings of model distiluse-base-multilingual-cased
05/19/2020 10:22:49 - INFO - root -   Load pretrained SentenceTransformer: distiluse-base-multilingual-cased
05/19/2020 10:22:49 - INFO - root -   Did not find a '/' or '\' in the name. Assume to download model from server.
05/19/2020 10:22:49 - INFO - root -   Load SentenceTransformer from folder: /home/pedro/.cache/torch/sentence_transformers/public.ukp.informatik.tu-darmstadt.de_reimers_sentence-transformers_v0.2_distiluse-base-multilingual-cased.zip
05/19/2020 10:22:51 - INFO - root -   Use pytorch device: cuda
Batches: 100%|█████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 148.86it/s]

Then a traceback

Traceback (most recent call last):
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 50, in dumps
    return json.dumps(
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/__init__.py", line 398, in dumps
    return cls(
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/encoder.py", line 296, in encode
    chunks = self.iterencode(o, _one_shot=True)
  File "/home/pedro/.local/lib/python3.8/site-packages/simplejson/encoder.py", line 378, in iterencode
    return _iterencode(o, 0)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 36, in default
    raise TypeError("Unable to serialize %r (type: %s)" % (data, type(data)))
TypeError: Unable to serialize -0.065520875 (type: <class 'numpy.float32'>)

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "sdk/accuracy_retriever.py", line 75, in <module>
    top_docs = haystack_retriever(q, doc_store)
  File "/mnt/Documents/Projets/BotPress/R_D/R_D_q_a/sdk/retrievers.py", line 113, in haystack_retriever
    top_docs = retriever.retrieve(query=q, top_k=10)
  File "/mnt/Documents/Projets/git_clones/haystack/haystack/retriever/elasticsearch.py", line 92, in retrieve
    documents = self.document_store.query_by_embedding(query_emb[0], top_k, candidate_doc_ids)
  File "/mnt/Documents/Projets/git_clones/haystack/haystack/database/elasticsearch.py", line 184, in query_by_embedding
    result = self.client.search(index=self.index, body=body)["hits"]["hits"]
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/client/utils.py", line 92, in _wrapped
    return func(*args, params=params, headers=headers, **kwargs)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/client/__init__.py", line 1622, in search
    return self.transport.perform_request(
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/transport.py", line 321, in perform_request
    body = self.serializer.dumps(body)
  File "/home/pedro/.local/lib/python3.8/site-packages/elasticsearch/serializer.py", line 54, in dumps
    raise SerializationError(data, e)
elasticsearch.exceptions.SerializationError: ({'size': 10, 'query': {'script_score': {'query': {'match_all': {}}, 'script': {'source': "cosineSimilarity(params.query_vector,doc['question_emb']) + 1.0", 'params': {'query_vector': [-0.065520875, 0.023728848, ... lot of numbers ..., 0.047961414]}}}}, '_source': {'excludes': ['question_emb']}}, TypeError("Unable to serialize -0.065520875 (type: <class 'numpy.float32'>)"))

tholor commented 4 years ago

Hey @Ierezell ,

For using the EmbeddingRetriever you'll need to add the embeddings for your documents to the Elasticsearch index first. At query time we can then embed the question on-the-fly and calculate similarity to the indexed embeddings.

Please have a look at Tutorial 4, where we do this for "FAQ style QA". If you wanted to apply this to regular extractive QA, you would need to adjust the "write_documents_to_db()" utility function to also create the embeddings.

We'll probably simplify this and add an example once we have implemented the DPR encoders (#63). Be aware that using sentence-transformers / USE embeddings for both (question and long passages) might not yield great performance. Dual encoder approaches usually work better here.

ierezell commented 4 years ago

Oh sorry I thought the write_to_db would get each sentence and embed it when initialising the db. I will embed them myself first, thanks.

Yes indeed, to overcome that I splitted my dataset in chunks. Because I'm looking for similarity on the content, chunking is okay and just lead to more documents. Furthermore, I'm doing unsupervised lookup so I don't have pairs to train an Dual encoder, but thanks a lot for the suggestion, it's a really good one :smiley:

tholor commented 4 years ago

Besides your issue of indexing the embeddings first, there was still a small issue with the serialization of float64 returned by sentence-transformers. Fixed this in #121.

tholor commented 4 years ago

Closing this as the original issue seems to be fixed. @Ierezell feel free to re-open in case there's still a problem on your side.

deepset-ai / haystack

Cannot use EmbeddingRetriever (Indexing of embeddings) #116