Closed lalitpagaria closed 3 years ago
There's a very nice implementation by Yacine Jernite at 🤗 that could be used as a foundation to work from: https://yjernite.github.io/lfqa.html
He uses raw Elasticsearch for the retriever, so Haystack would certainly simplify a lot of that analysis!
@lewtun would you like to work on it? I think @Timoeller would be happy :)
hey @lalitpagaria, i would love to tackle this but unfortunately have no bandwidth for it right now 😢 if that changes and the issue is still open, i'll have a stab at it!
@lalitpagaria @lewtun @tholor  I'd like to take this one. I implemented a quick-and-dirty prototype using Yacine's models and it seems to be working ok. Although I don't have any metrics yet, I can see that the seq2seq model is indeed generating answers conditioned on documents given by the retriever.
What would be the ideal set of deliverables for LFQA? Perhaps we can implement LFQA in a few stages. In the first stage we can add initial implementation based on existing Yacine Jernite models, including demos, but without model training. In the next stage add model training, if needed. I am not sure how usable the training part would be as ELI5 seems to be the only dataset targeting LFQA, but I could be overlooking something here as I am relatively new to this particular task.Â
Perhaps we can add all of these in one PR? LMK your preferences.
Hey guys, here is LFQA implementation preview https://github.com/vblagoje/haystack/tree/lfqa_h , you can also checkout the notebook LMK what is the best way to proceed from here. I'd love to hear your feedback.
Awesome, thanks for working on it @vblagoje! A few thoughts on how to slice this work into meaningful stages / pull requests:
I haven't found the time yet to check your branch (and Yacine's notebook) in detail, but in all the above steps let's make sure to use meaningful abstractions for the retriever and generator classes that fit well with the rest of Haystack. What do I mean by that? We already have an EmbeddingRetriever
(single encoder) and a DensePassageRetriever
(dual encoder). If there's a big overlap with RetribertRetriever
let's rather integrate it there. If not, let's create a new "generic" retriever class that captures the essence of Retribert but would also work with other base models (e.g. Roberta). Same goes for the Generator (RAG is quite specific here at the moment, but maybe there's potential for generalization).
Happy to review an early PR and give more detailed feedback!
Awesome work @vblagoje.
I looked into your code also checked https://yjernite.github.io/lfqa.html. I have same comments as @tholor -
yjernite/retribert-base-uncased
instead of creating new RetribertRetriever
classSeq2SeqGenerator
and RAGGenerator
so it would be good to move common code to BaseGenerator
classSeq2SeqGenerator
class can be moved to transformers.py file as in both cases we are using huggingface hosted modelImplemented in #1086
Creating placeholder issue to integrate Open-domain long-form question answering (LFQA) with Haystack. I feel it is very relevant with Haystack.
Hopefully soon we will see good implementation in the regards. Or if someone is existed about experimenting on it then refer following paper which suggest two ways to achieve this - Article: https://ai.googleblog.com/2021/03/progress-and-challenges-in-long-form.html Paper: https://arxiv.org/abs/2103.06332 Dataset and Info: https://ai.facebook.com/blog/longform-qa/