Using metadata to boost the performance of ExtractiveReader

sjrl commented 10 months ago

Is your feature request related to a problem? Please describe. I would like to be able to use meta information to provide context to the TransformerReader or the FARMReader to boost the performance of answering questions in a similar way to how we can use embed_meta_fields to boost the performance of EmbeddingRetrievers. Sometimes meta information is needed to distinguish between similar documents.

We have had multiple clients face this exact problem because they are retrieving info from lots of legal PDF files which have a lot of boilerplate text and often define things like company name once at the beginning of a 60-page PDF.

Describe the solution you'd like As motivation I'd like to walk through an example where being able to add meta information from a document to the Reader at query time would be beneficial. Pretend I have two docs that have a similar structure and contain similar information, but about two different companies:

Document 1 (comes from pear_llc_contract.pdf)

# meta info
meta = {"additional_context": "This passage is about the company Pear, from the year 2020."}

# content of Document
Company ID: 312521124141
Deal amount: 100k
Two leading organizations have joined forces in a groundbreaking partnership that promises to revolutionize their respective industries. The agreement, which was finalized after months of negotiations, will see the companies collaborate on a range of exciting initiatives that will benefit both parties and their customers.

Document 2 (comes from rainforest_contract.pdf)

# meta info
meta = {"additional_context": "This passage is about the company Rainforest, from the year 2019."}

# content of Document
Company ID: 847584923
Deal amount: 60k
The deal is expected to generate significant benefits for both companies, including increased revenue, improved operational efficiency, and enhanced customer experience. It is also expected to create new jobs and stimulate economic growth in the regions where the companies operate.

I would like to ask the question "What is the company ID of Pear LLC?" However, nowhere in the content of the document does it specify the name of the companies involved in the deal. So if provide these two documents to a FARMReader I should get about a 50/50 chance of getting the correct answer.

However, if I could specify a new variable (e.g. embed_meta_fields like we can for EmbeddingRetrievers

reader = ExtractiveReader(model="deepset/deberta-v3-large-squad2", embed_meta_fields=["additional_context"])

then the FARMReader will have the necessary context to answer the question.

Additional context

This is a similar idea to how we can use PromptTemplates to provide context to the PromptNode. And already in PromptTemplates we can add meta information from the Document into the prompt using special variables. I think extending this to an extractive reader would still be very beneficial because Sol has still seen quite some interest in extractive models.
However, one difference is that we should consider if we prevent the ExtractiveReader from returning the additional_context as an answer, since the additional_context will not be present in the returned Document to the user.

ZanSara commented 10 months ago

Hey @sjrl, could this be a feature of ExtractiveReader, rather than FARMReader? We're trying to bring feature parity between them, so new features should be added to ExtractiveReader directly.

If so, let's change the title and mark this as a Haystack 2.x feature request. If not, let's figure out why :slightly_smiling_face:

sjrl commented 10 months ago

Yes definitely. This could be a feature for ExtractiveReader.

Timoeller commented 9 months ago

I dont understand why it should be a meta field. Can't this info be added to documents during preprocessing? In any case, if it is urgent for any of the clients, feel free to open a lightweight PR. I would prefer though to handle it outside of the Reader.

sjrl commented 9 months ago

I dont understand why it should be a meta field.

I think often we will not want this additional information to be allowed to be returned as an answer by the reader. So this point from my original description:

However, one difference is that we should consider if we prevent the ExtractiveReader from returning the additional_context as an answer, since the additional_context will not be present in the returned Document to the user.

That's why just directly adding it to the preprocessed document would not work.

I would prefer though to handle it outside of the Reader.

Given that I think preferably we would not allow this additional text to be returned as an answer I think it would be better to integrate it within the ExtractiveReader.

What do you think?

Timoeller commented 9 months ago

Mh, still not sure about this. In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly. Without the adiitional_context this might not be possible. I think having additional_context inside the document would be fine (with a clear indication that it is added?).

What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.

Feel free to open a lightweight PR for this feature.

sjrl commented 9 months ago

What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.

I would say that embed_meta_fields obscures the addition of the meta data to the text file. The embed_meta_fields feature only adds the text at indexing time, but when searching the end-user doesn't see that this meta info was prepended to the document.

In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly.

However, this is a really good point. Maybe a compromise could be that we add the additional_context to the document in the returned Haystack Answer so the user can see it, but we still restrict the model from returning the additional_context as part of the answer?

deepset-ai / haystack

Using metadata to boost the performance of ExtractiveReader #5640