deepset-ai / haystack

:mag: AI orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.
https://haystack.deepset.ai
Apache License 2.0
16.88k stars 1.85k forks source link

Error with Pipeline #2074

Closed kingafy closed 2 years ago

kingafy commented 2 years ago

While trying pipeline for the below piece of code I am getting error:- from haystack.pipelines import JoinDocuments p_ensemble = Pipeline() p_ensemble.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"]) p_ensemble.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"]) p_ensemble.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"]) p_ensemble.add_node(component=reader, name="Reader", inputs=["JoinResults"])

p_ensemble.draw("pipeline_ensemble.png")

Run pipeline

res = p_ensemble.run( query="Who is the father of Arya Stark?", params={"DPRRetriever": {"top_k": 2}, "ESRetriever": {"top_k": 2}} ) Error:---- Traceback (most recent call last): File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\pipelines\base.py", line 335, in run node_output, stream_id = self.graph.nodes[node_id]["component"]._dispatch_run(node_input) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\nodes\base.py", line 135, in _dispatch_run output, stream = self.run(run_inputs, *run_params) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\nodes\retriever\base.py", line 233, in run output, stream = run_query_timed(query=query, filters=filters, top_k=top_k, index=index, headers=headers) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\nodes\retriever\base.py", line 77, in wrapper ret = fn(args, **kwargs) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\nodes\retriever\base.py", line 250, in run_query documents = self.retrieve(query=query, filters=filters, top_k=top_k, index=index, headers=headers) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\nodes\retriever\dense.py", line 215, in retrieve documents = self.document_store.query_by_embedding(query_emb=query_emb[0], top_k=top_k, filters=filters, index=index, headers=headers) File "c:\my_projects\haystackrnr\haystack-new\haystack\haystack\document_stores\faiss.py", line 533, in query_by_embedding score_matrix, vector_id_matrix = self.faiss_indexes[index].search(query_emb, top_k) File "C:\Users\anshuman.a.mahapatra\Anaconda3\envs\haystack_new\lib\site-packages\faiss__init__.py", line 341, in replacement_search assert d == self.d AssertionError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "pipeline.py", line 58, in params={"DPRRetriever": {"top_k": 2}, "ESRetriever": {"top_k": 2}}

ZanSara commented 2 years ago

Hello @kingafy, unfortunately I can't reproduce your error with the information you provide. For me, the following snippet works fine to the end:

from haystack.utils import clean_wiki_text, print_answers, convert_files_to_dicts, launch_es
from haystack import Pipeline
from haystack.document_stores import ElasticsearchDocumentStore
from haystack.nodes import ElasticsearchRetriever, DensePassageRetriever, FARMReader, JoinDocuments

doc_dir = "data/article_txt_got"
s3_url = "https://s3.eu-central-1.amazonaws.com/deepset.ai-farm-qa/datasets/documents/wiki_gameofthrones_txt.zip"
fetch_archive_from_http(url=s3_url, output_dir=doc_dir)

got_dicts = convert_files_to_dicts(
    dir_path=doc_dir,
    clean_func=clean_wiki_text,
    split_paragraphs=True
)

launch_es()
document_store = ElasticsearchDocumentStore()
document_store.delete_documents()
document_store.write_documents(got_dicts)

es_retriever = ElasticsearchRetriever(document_store=document_store)
reader = FARMReader(model_name_or_path="deepset/roberta-base-squad2")
dpr_retriever = DensePassageRetriever(document_store)
document_store.update_embeddings(dpr_retriever, update_existing_embeddings=False)

p_ensemble = Pipeline()
p_ensemble.add_node(component=es_retriever, name="ESRetriever", inputs=["Query"])
p_ensemble.add_node(component=dpr_retriever, name="DPRRetriever", inputs=["Query"])
p_ensemble.add_node(component=JoinDocuments(join_mode="concatenate"), name="JoinResults", inputs=["ESRetriever", "DPRRetriever"])
p_ensemble.add_node(component=reader, name="Reader", inputs=["JoinResults"])
p_ensemble.draw("pipeline_ensemble.png")

query="Who is the father of Arya Stark?"
res = p_ensemble.run(
    query="Who is the father of Arya Stark?",
    params={"ESRetriever": {"top_k": 2}, "DPRRetriever": {"top_k": 2}},

)
print("\nQuery: ", query)
print("Answers:")
print_answers(res, details="minimum")

Do you have the same issue executing this script? If so, please share:

kingafy commented 2 years ago

Can't I use this way of referring two docstore:- document_store_es = ElasticsearchDocumentStore(host="localhost", username="", password="", index="document") es_retriever = ElasticsearchRetriever(document_store=document_store_es)

Initialize dense retriever

document_store = FAISSDocumentStore.load("my_faiss_index.faiss")

document_store_dpr = FAISSDocumentStore.load("my_faiss_index_eli5.faiss")

ZanSara commented 2 years ago

I'm sorry @kingafy but I don't get your question. Sure you can have two docstores, if your pipeline is setup correctly. What error are you getting? You need to provide me a complete code snippet or I can't help you :confused:

bogdankostic commented 2 years ago

Hey @kingafy, did you manage to solve your problem? Looking at the error trace from your initial comment, it seems that there is a mismatch in dimensionality (the assertion assert d == self.d) fails. What model are you using for embedding your documents and what value did you set for embedding_dim when initializing your FAISSDocumentStore?

tstadel commented 2 years ago

@kingafy as you haven't answered in a while, I assume you found a solution and close this issue. Feel free to reopen it at any time if the problem still exists.