Closed predoctech closed 1 year ago
Hey @predoctech
I see that here there's an issue in documentation. I've tried the pipeline and summarizer and they seem to work alright, however the summary doesn't come in an "answer". The results of a summarizer are returned as a list of Documents
as indicated here in our documentation: https://haystack.deepset.ai/reference/summarizer
However, the summary isn't in the text
field as indicated in documentation, it is in the content
field. Below you can see the pipeline I created and the results I got:
from haystack.nodes import TransformersSummarizer
from haystack.pipelines import Pipeline
summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"])
Then I get the results and print them. I get 1 document in a list of documents when I set generate_single_summary = True
and the 3 you see here if it's False:
from haystack.utils import print_documents
results = p.run(query="Who is Arya?", params={"ESRetriever": {"top_k": 3},"Summarizer": {"generate_single_summary": False}})
print_documents(results)
Output:
Query: Who is Arya?
{ 'content': 'Arya Stark (Maisie Williams) and Jon Snow (Kit Harington) '
'return to Westeros in the latest episode of Game of Thrones.',
'name': '43_Arya_Stark.txt'}
{ 'content': "Arya Stark (Maisie Williams) and Jaqen H'ghar (Iwan Rheon) are "
'sent to Braavos to join the Faceless Men.',
'name': '43_Arya_Stark.txt'}
{ 'content': 'Arya Stark is a character in the TV series Game of Thrones.',
'name': '43_Arya_Stark.txt'}
Let me know if there are any further issues and if not I will close this one 😊 - I'll create an issue on documentation about the wrong info there
Hi @TuanaCelik ,
Thanks for coming back. In your example which retriever component did you use? As mentioned we need to use EmbeddingRetriever as we are dealing with FAQ type data set:
retriever = EmbeddingRetriever(document_store=document_store, embedding_model='sentence-transformers/multi-qa-MiniLM-L6-cos-v1', use_gpu=True, scale_score=False)
I noticed that you typically make use of BM25retriever for your example, and I don't know if that is what caused the difference.
When we pass this as the retriever to the pipeline the same way as in your last example, and running with "generate_single_summary=False", we got this:
Query: Does a Family Office serving multiple clients need to be registered?
{ 'content': 'When will a single family office setup not be as carrying on a '
'business?',
'name': None}
{ 'content': 'What constitutes a multi-family office for the purposes of the '
'SFC Rules and is a multi.-family office required to be '
'licensed?',
'name': None}
{ 'content': 'Is a single family office required to be licensed under the '
'Ordinance?',
'name': None}
And when "generate_single_summary=True", we got this:
Query: Does a Family Office serving multiple clients need to be registered?
{ 'content': 'When will a single family office setup not be as carrying on a '
'business? What constitutes a multi-family office for the '
'purposes of the SFC Rules and is it a Single-Family office '
'required to be licensed? Is a single Family office required '
'under the Ordinance?',
'name': None}
which is a simple concatenation of the 3 retrieved questions from above.
As you can tell these are the retrieved questions from FAQ data matched to our query. What is expected are the answers of these matched FAQ to be summarized and returned.
Hope this helps to clarify our question.
@TuanaCelik just wish to know if this issue is still under investigation, and there will be further advice from support? If not please share with us what kind of mistakes we've made such that the summarizer doesn't return summary of retrieved FAQ answers. Thanks.
Hi @predoctech - yes I have been trying this out myself to understand what's going on and so far here's what I see.
FAQ pipelines return 2 things in 'results':
Document
type each with its content
, but in the case of an FAQ style dataset it's the FAQ itself, so the types are still Document
but the content is simply the FAQ question (I think this is what you're trying to summarize)The difficulty here is that the resutls['answers'] contain Answer
types. Which unfortunately can't be passed into the Summarizer as is because the component expects Document
.
I forgot to ask whether you were indeed using the FAQPipeline
but it looks like if you're not it's still quite similar as the FAQPipeline is a wrapper around a retriever. But nonetheless I've tried to come up with an example that might help you out:
I've copied and modified our FAQ tutorial here
Here, at the bottom you will see some modifications. Most important is the docs
being filled with the 'answers'. Then, I am able to pass docs in to a summarizer which then produces summaries that make a lot more sense. You might have to make some adjustments to the docs
object to include any extra metadata/names you want to have.
I hope this helps. And in the meantime I am also asking our team whether we should make any changes to the summarizer or FAQ pipeline to make this a but easier.
Hi @TuanaCelik thanks for coming back.
Yes we have been testing this with FAQPipeline
. Yes we were hoping to summarize the "Answer" text from the retrieved FAQ.
And yes we are aware that the returned results are gonna be an Answer
object from the FAQPipeline
node .
What we have missed here I think is that we hope the pipeline can take care of these inconsistencies in the output of a node and automatically transform into the required format in the next node. Apparent that isn't the case.
Given your comments should we tag this with at least an "feature request" or "improvement" label?
Hi @TuanaCelik based on your suggested changes to the FAQ tutorial can I confirm that we now can't run the said 2 nodes as a custom pipeline like below:
from haystack.nodes import TransformersSummarizer
from haystack.pipelines import Pipeline
summarizer = TransformersSummarizer(model_name_or_path="google/pegasus-xsum")
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"])
since we need to modify the output from retriever
before passing as input to summarizer
like you have suggested in the modified FAQ tutorial?
Hi @predoctech! To be able to have the whole process in a single Pipeline, you could add a custom node that takes as input a list of Answers and transforms it into a list of Documents. This node would then be needed to be places between the retriever and the summarizer nodes.
Please have a look here in our documentation about how to add a custom node and let me know if you need further help.
I'm closing this as there is a solution listed, @predoctech feel free to reopen should you have any additional follow up.
Describe the bug Have a 2-node pipeline starting with a ESRetriever with Query as input, and passing the output to a Summarizer hoping to get a summarized version of the documents retrieved as output. That didn't happen. Please help to suggest how that can be achieved if the above is not the right approah.
Error message Depends on if the param "generate_single_summary" is set to True of False: If set to True: the end-point of the pipeline has no "answer" value pair in the resulting dict. If set to False: the documents extracted by ESRetriever is in the resulting dict, but there is no summarization of those texts of any kind.
Expected behavior Hoping to make use of a FAQ type retriever and extract the top_k matching Answer documents. Then pass them to a summarizer model and have a summarized version of those answers coming from the FAQ database.
Additional context Code used:
p = Pipeline()
p.add_node(component=retriever, name="ESRetriever", inputs=["Query"])
p.add_node(component=summarizer, name="Summarizer", inputs=["ESRetriever"])
`results = p.run(query=qn, params={"ESRetriever": {"top_k": 3},"Summarizer": {"generate_single_summary": True}})
To Reproduce Per the code above
FAQ Check
System: