Passing a document output from an EmbeddingRetriever to Summarizer in a pipeline does not generate expected results

deepset-ai / haystack

:mag: LLM orchestration framework to build customizable, production-ready LLM applications. Connect components (models, vector DBs, file converters) to pipelines or agents that can interact with your data. With advanced retrieval methods, it's best suited for building RAG, question answering, semantic search or conversational agent chatbots.

https://haystack.deepset.ai

Apache License 2.0

14.66k stars 1.72k forks source link

Passing a document output from an EmbeddingRetriever to Summarizer in a pipeline does not generate expected results #2992

Closed predoctech closed 1 year ago

predoctech commented 1 year ago

Describe the bug Have a 2-node pipeline starting with a ESRetriever with Query as input, and passing the output to a Summarizer hoping to get a summarized version of the documents retrieved as output. That didn't happen. Please help to suggest how that can be achieved if the above is not the right approah.

Error message Depends on if the param "generate_single_summary" is set to True of False: If set to True: the end-point of the pipeline has no "answer" value pair in the resulting dict. If set to False: the documents extracted by ESRetriever is in the resulting dict, but there is no summarization of those texts of any kind.

Expected behavior Hoping to make use of a FAQ type retriever and extract the top_k matching Answer documents. Then pass them to a summarizer model and have a summarized version of those answers coming from the FAQ database.

Additional context Code used: p = Pipeline() p.add_node(component=retriever, name="ESRetriever", inputs=["Query"]) p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"]) `results = p.run(query=qn, params={"ESRetriever": {"top_k": 3},"Summarizer": {"generate_single_summary": True}})

To Reproduce Per the code above

FAQ Check

[X] Have you had a look at our new FAQ page?

System:

OS:
GPU/CPU:
Haystack version (commit or version number): lastest
DocumentStore: ElasticSearchDocumentStore
Reader:
Retriever: EmbeddingRetriever

TuanaCelik commented 1 year ago

Hey @predoctech I see that here there's an issue in documentation. I've tried the pipeline and summarizer and they seem to work alright, however the summary doesn't come in an "answer". The results of a summarizer are returned as a list of Documents as indicated here in our documentation: https://haystack.deepset.ai/reference/summarizer

However, the summary isn't in the text field as indicated in documentation, it is in the content field. Below you can see the pipeline I created and the results I got:

from haystack.nodes import TransformersSummarizer
from haystack.pipelines import Pipeline

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"])

Then I get the results and print them. I get 1 document in a list of documents when I set generate_single_summary = True and the 3 you see here if it's False:

from haystack.utils import print_documents

results = p.run(query="Who is Arya?", params={"ESRetriever": {"top_k": 3},"Summarizer": {"generate_single_summary": False}})

print_documents(results)

Output:

Query: Who is Arya?

{   'content': 'Arya Stark (Maisie Williams) and Jon Snow (Kit Harington) '
               'return to Westeros in the latest episode of Game of Thrones.',
    'name': '43_Arya_Stark.txt'}

{   'content': "Arya Stark (Maisie Williams) and Jaqen H'ghar (Iwan Rheon) are "
               'sent to Braavos to join the Faceless Men.',
    'name': '43_Arya_Stark.txt'}

{   'content': 'Arya Stark is a character in the TV series Game of Thrones.',
    'name': '43_Arya_Stark.txt'}

Let me know if there are any further issues and if not I will close this one 😊 - I'll create an issue on documentation about the wrong info there

predoctech commented 1 year ago

Hi @TuanaCelik , Thanks for coming back. In your example which retriever component did you use? As mentioned we need to use EmbeddingRetriever as we are dealing with FAQ type data set: retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/multi-qa-MiniLM-L6-cos-v1', use_gpu=True, scale_score=False) I noticed that you typically make use of BM25retriever for your example, and I don't know if that is what caused the difference.

When we pass this as the retriever to the pipeline the same way as in your last example, and running with "generate_single_summary=False", we got this:

Query: Does a Family Office serving multiple clients need to be registered?

{   'content': 'When will a single family office setup not be as carrying on a '
               'business?',
    'name': None}

{   'content': 'What constitutes a multi-family office for the purposes of the '
               'SFC Rules and is a multi.-family office required to be '
               'licensed?',
    'name': None}

{   'content': 'Is a single family office required to be licensed under the '
               'Ordinance?',
    'name': None}

And when "generate_single_summary=True", we got this:

Query: Does a Family Office serving multiple clients need to be registered?

{   'content': 'When will a single family office setup not be as carrying on a '
               'business? What constitutes a multi-family office for the '
               'purposes of the SFC Rules and is it a Single-Family office '
               'required to be licensed? Is a single Family office required '
               'under the Ordinance?',
    'name': None}

which is a simple concatenation of the 3 retrieved questions from above.

As you can tell these are the retrieved questions from FAQ data matched to our query. What is expected are the answers of these matched FAQ to be summarized and returned.

Hope this helps to clarify our question.

predoctech commented 1 year ago

@TuanaCelik just wish to know if this issue is still under investigation, and there will be further advice from support? If not please share with us what kind of mistakes we've made such that the summarizer doesn't return summary of retrieved FAQ answers. Thanks.

TuanaCelik commented 1 year ago

Hi @predoctech - yes I have been trying this out myself to understand what's going on and so far here's what I see.

FAQ pipelines return 2 things in 'results':

results['documents'] which normally contains Document type each with its content, but in the case of an FAQ style dataset it's the FAQ itself, so the types are still Document but the content is simply the FAQ question (I think this is what you're trying to summarize)
Then there's also results['answers'] which in the case of FAQ style data has the actual content (answer) to the FAQ question. So I think this is the one that you should be trying to summarize.

The difficulty here is that the resutls['answers'] contain Answer types. Which unfortunately can't be passed into the Summarizer as is because the component expects Document.

I forgot to ask whether you were indeed using the FAQPipeline but it looks like if you're not it's still quite similar as the FAQPipeline is a wrapper around a retriever. But nonetheless I've tried to come up with an example that might help you out:

I've copied and modified our FAQ tutorial here

Here, at the bottom you will see some modifications. Most important is the docs being filled with the 'answers'. Then, I am able to pass docs in to a summarizer which then produces summaries that make a lot more sense. You might have to make some adjustments to the docs object to include any extra metadata/names you want to have.

I hope this helps. And in the meantime I am also asking our team whether we should make any changes to the summarizer or FAQ pipeline to make this a but easier.

predoctech commented 1 year ago

Hi @TuanaCelik thanks for coming back. Yes we have been testing this with FAQPipeline. Yes we were hoping to summarize the "Answer" text from the retrieved FAQ. And yes we are aware that the returned results are gonna be an Answer object from the FAQPipeline node . What we have missed here I think is that we hope the pipeline can take care of these inconsistencies in the output of a node and automatically transform into the required format in the next node. Apparent that isn't the case. Given your comments should we tag this with at least an "feature request" or "improvement" label?

predoctech commented 1 year ago

Hi @TuanaCelik based on your suggested changes to the FAQ tutorial can I confirm that we now can't run the said 2 nodes as a custom pipeline like below:

from haystack.nodes import TransformersSummarizer
from haystack.pipelines import Pipeline

summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")

p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"])

since we need to modify the output from retriever before passing as input to summarizer like you have suggested in the modified FAQ tutorial?

bogdankostic commented 1 year ago

Hi @predoctech! To be able to have the whole process in a single Pipeline, you could add a custom node that takes as input a list of Answers and transforms it into a list of Documents. This node would then be needed to be places between the retriever and the summarizer nodes.

Please have a look here in our documentation about how to add a custom node and let me know if you need further help.

masci commented 1 year ago

I'm closing this as there is a solution listed, @predoctech feel free to reopen should you have any additional follow up.