Open sjrl opened 10 months ago
Hey @sjrl, could this be a feature of ExtractiveReader
, rather than FARMReader
? We're trying to bring feature parity between them, so new features should be added to ExtractiveReader
directly.
If so, let's change the title and mark this as a Haystack 2.x feature request. If not, let's figure out why :slightly_smiling_face:
Yes definitely. This could be a feature for ExtractiveReader.
I dont understand why it should be a meta field. Can't this info be added to documents during preprocessing? In any case, if it is urgent for any of the clients, feel free to open a lightweight PR. I would prefer though to handle it outside of the Reader.
I dont understand why it should be a meta field.
I think often we will not want this additional information to be allowed to be returned as an answer by the reader. So this point from my original description:
- However, one difference is that we should consider if we prevent the ExtractiveReader from returning the additional_context as an answer, since the additional_context will not be present in the returned Document to the user.
That's why just directly adding it to the preprocessed document would not work.
I would prefer though to handle it outside of the Reader.
Given that I think preferably we would not allow this additional text to be returned as an answer I think it would be better to integrate it within the ExtractiveReader.
What do you think?
Mh, still not sure about this. In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly. Without the adiitional_context this might not be possible. I think having additional_context inside the document would be fine (with a clear indication that it is added?).
What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.
Feel free to open a lightweight PR for this feature.
What I like about this idea is that it is similarly designed like embed_meta_fields of embedders.
I would say that embed_meta_fields
obscures the addition of the meta data to the text file. The embed_meta_fields
feature only adds the text at indexing time, but when searching the end-user doesn't see that this meta info was prepended to the document.
In the prompt, users can check what was passed to the model. With Extractive QA we want to ensure even more that the user can check the predictions properly.
However, this is a really good point. Maybe a compromise could be that we add the additional_context to the document in the returned Haystack Answer so the user can see it, but we still restrict the model from returning the additional_context as part of the answer?
Is your feature request related to a problem? Please describe. I would like to be able to use meta information to provide context to the TransformerReader or the FARMReader to boost the performance of answering questions in a similar way to how we can use
embed_meta_fields
to boost the performance of EmbeddingRetrievers. Sometimes meta information is needed to distinguish between similar documents.We have had multiple clients face this exact problem because they are retrieving info from lots of legal PDF files which have a lot of boilerplate text and often define things like company name once at the beginning of a 60-page PDF.
Describe the solution you'd like As motivation I'd like to walk through an example where being able to add meta information from a document to the Reader at query time would be beneficial. Pretend I have two docs that have a similar structure and contain similar information, but about two different companies:
Document 1 (comes from pear_llc_contract.pdf)
Document 2 (comes from rainforest_contract.pdf)
I would like to ask the question "What is the company ID of Pear LLC?" However, nowhere in the content of the document does it specify the name of the companies involved in the deal. So if provide these two documents to a FARMReader I should get about a 50/50 chance of getting the correct answer.
However, if I could specify a new variable (e.g.
embed_meta_fields
like we can for EmbeddingRetrieversthen the FARMReader will have the necessary context to answer the question.
Additional context
additional_context
as an answer, since the additional_context will not be present in the returned Document to the user.