Sherlock-coder / CorpusBrainPlusPlus

CorpusBrain++: A Continual Generative Pre-Training Framework for Knowledge-Intensive Language Tasks
4 stars 0 forks source link

Question about the pseudo-queries and pre-training on a continual single task. #1

Open LouisDo2108 opened 3 months ago

LouisDo2108 commented 3 months ago

Dear @Sherlock-coder ,

Thank you for sharing your implementation and the awesome paper. I have some questions regarding the single task scenario (only consider Open-domain QA), where the setup is similar to IncDSI

Best regards, Louis.

Sherlock-coder commented 3 months ago

For the first question, I think the answer actually depends on the experimental setup in your work. I reviewed our experimental setup and IncDSI's, and they look similar but are indeed different, in IncDSI they consider corresponding query-docid pairs for new document sets $D{t>1}$ available, but we consider them not. We argue that in a realistic scenario when a search engine indexes new documents there is no user feedback yet and thus no labeled data. In fact, in terms of the KILT dataset, query-docid pairs for $D{t>1}$ are available as well, but we did not bother to use them. Anyway, I think the answer to this question may depend on your experimental setup, if you have the same experimental setup as IncDSI then you use the labeled data directly, if you have the same experimental setup as ours then you need to construct pseudo query-docid pairs.

For the second question, I think pseudo-queries can still be constructed by leveraging ISS followed by sampling an 𝑛-gram span, please refer to tasks/qa/generate.py. Furthermore, since we focus on downstream multi-task scenarios, we take more efficiency into consideration when constructing pseudo-queries. With even less regard for efficiency, I think constructing pseudo-queries using docTTTTTquery is also an option (refer to DSI++ or IncDSI), and we actually tested it and achieved good results.