deepset-ai / haystack-core-integrations

Additional packages (components, document stores and the likes) to extend the capabilities of Haystack version 2.0 and onwards
https://haystack.deepset.ai
Apache License 2.0
99 stars 92 forks source link

ElasticSearch Retriever is not performing well #598

Closed Asma-droid closed 2 weeks ago

Asma-droid commented 5 months ago

Hello,

i'am using ElasticSearch as DocumentStore. So, i am using elastic search retrieval as follows

 embedding_retriever:
    init_parameters:
      document_store:
        embedding_similarity_function: l2_norm
        init_parameters:
          hosts: http://elasticsearch:9200
        type: haystack_integrations.document_stores.elasticsearch.document_store.ElasticsearchDocumentStore
      num_candidates: 10
      top_k: 10
    type: haystack_integrations.components.retrievers.elasticsearch.embedding_retriever.ElasticsearchEmbeddingRetriever

Although answer is out of the context, the retriever still return documents with high score. below is an example

{ "AnswerBuilder": { "answers": [ { "data": " The context provided does not contain information about Langchain.", "query": "WHat is langchain ?", "documents": [ { "id": "b0b39b5c34c63991019b566e34b1ccfb784cf96a461cebc3711611fd5d9b8b38", "content": "general-purpose speech toolkit. arXiv preprint\narXiv:2106.04624 .\nRebai, I., Benhamiche, S., Thompson, K., Sellami, Z.,\nLaine, D., and Lorr ´e, J.-P. (2020). Linto platform: A\nsmart open voice assistant for business environments.\nInProceedings of the 1st International Workshop on\nLanguage Technology Platforms , pages 89–95.\nRNNoise (2023). Github RNNoise. https://github.com/\nxiph/rnnoise.\nSpiller, T. R., Ben-Zion, Z., Korem, N., Harpaz-Rotem, I.,\nand Duek, O. (2023). Efficient and accurate transcrip-\ntion in mental health research-a tutorial on using whis-\nper ai for sound file transcription.Suznjevic, M. and Saldana, J. (2016). Delay limits for real-\ntime services. IETF draft .\nTrabelsi, A., Warichet, S., Aajaoun, Y ., and Soussilane, S.\n(2022). Evaluation of the efficiency of state-of-the-\nart speech recognition engines. Procedia Computer\nScience , 207:2242–2252.\nUnion, I. T. (2016). Mean opinion score interpretation and\nreporting. Standard, International Telecommunication\nUnion, Geneva, CH.\nValin, J.-M. (2018). A hybrid dsp/deep learning approach\nto real-time full-band speech enhancement. In 2018\nIEEE 20th international workshop on multimedia sig-\nnal processing (MMSP) , pages 1–5. IEEE.\nVaseghi, S. V . (2008). Advanced digital ", "dataframe": null, "blob": null, "meta": { "source": "default/ICAART24.pdf", "page": 7, "source_id": "74d29100e8daffb446d9d6e1c7185e096e3a51cf9332fc6c421cd9ca467648d6" }, "score": 0.67131597,

Best regards

DemirTonchev commented 5 months ago

Elastic search uses bm25 algorithm, why do think score of 0.67 is high?

Asma-droid commented 5 months ago

@DemirTonchev i am using ES embedding Retriever. For query matchs with retrieved documents i have as well score between 0.60 and 0.82. So for me if the query does not match with retrieved documents, scores should be very small.

DemirTonchev commented 5 months ago

So for me if the query does not match with retrieved documents, scores should be very small.

Score of 0.6 - 0.82 is usually (in my experience) negligibly small. What is the length of your corpus and average idf? Looking at the query "WHat is langchain ?" and seeing the output document I would expect the score is small, there is no "langchain" in the returned text. How many documents are in the corpus that contain at least one occurrence of "langchain"? Also I suspect that " " (white space) is in your ES Doc store, which is not ideal.

Asma-droid commented 5 months ago

@DemirTonchev in my documentstore i have just one document that talks about Vosk and Kaldi! There is no Occurance of langchain. I did this on purpose to see how the model behaves

When i ask a question about vosk, I have the good answer with score equals 0.67. Below is a screenshot

image

I remark that the score is between 0 and 1 .

So my conclusion is that when we ask a question out of context the retriever still return results with +- high score.

Can you please explain more the whitespace problem. I cannot got it.

anakin87 commented 5 months ago

Should be investigated.