Closed ierezell closed 4 years ago
Hey @Ierezell ,
For using the EmbeddingRetriever
you'll need to add the embeddings for your documents to the Elasticsearch index first. At query time we can then embed the question on-the-fly and calculate similarity to the indexed embeddings.
Please have a look at Tutorial 4, where we do this for "FAQ style QA". If you wanted to apply this to regular extractive QA, you would need to adjust the "write_documents_to_db()" utility function to also create the embeddings.
We'll probably simplify this and add an example once we have implemented the DPR encoders (#63). Be aware that using sentence-transformers / USE embeddings for both (question and long passages) might not yield great performance. Dual encoder approaches usually work better here.
Oh sorry I thought the write_to_db
would get each sentence and embed it when initialising the db. I will embed them myself first, thanks.
Yes indeed, to overcome that I splitted my dataset in chunks. Because I'm looking for similarity on the content, chunking is okay and just lead to more documents. Furthermore, I'm doing unsupervised lookup so I don't have pairs to train an Dual encoder, but thanks a lot for the suggestion, it's a really good one :smiley:
Besides your issue of indexing the embeddings first, there was still a small issue with the serialization of float64 returned by sentence-transformers. Fixed this in #121.
Closing this as the original issue seems to be fixed. @Ierezell feel free to re-open in case there's still a problem on your side.
I want to use an
EmbeddingRetriever
instead of aBM25
.I have a huge handcrafted corpus of text files, each corresponding to a paragraph.
The goal is to ask a question, get the k best paragraph regarding the embeddings similarity (text files and question). I've done my own method but I would like to compare with Haystack.
I'm creating the
DocumentStore
as in the doc :Then the retriever (multilang sentence transformer):
And finnaly when I try to retrieve the top k documents :
I got this info :
Then a traceback