Closed Hannibal046 closed 2 years ago
Yes @Hannibal046 you are right. I haven't tried with N > 2 due to computational budget.
@Ravoxsg Thanks for your reply ! I am wondering have you ever conduct some experiments to verify the distribution shift from training set to test set ? I get so confused in this phenomenon https://github.com/yixinL7/SimCLS/issues/16
@Hannibal046 this is actually what we see in the paper when comparing the base setup and transfer setup. On CNNDM with PEGASUS, we get around 9% relative improvement (Table 4 bottom block). But in the transfer setup, we only get around 5% relative improvement (Table 5). But 5% on top of SOTA is good :) thus motivating to use the transfer setup.
@Ravoxsg , Hi, Thanks for reply. In my opinion, the premise of the base setup and transfer setup is the existence of train-test distribution shift. Using transfer setup did improve the summarization quality even using beam search compared with base setup because of more data. But these are the results of test set. My question is about the train set. I sample 2,000 samples of train set of CNNDM and use fine-tuned bart-large to generate candidates , surprisingly to find that train set gives almost identical results with test set, even the model got fine tuned on it.
@Hannibal046 interesting finding. This suggests that the model is still quite under-fitting the data, at least with regards to ROUGE. I am not sure I understand fully the sentence"Using transfer setup did improve the summarization quality even using beam search compared with base setup because of more data." Transfer setup means using at inference a base model fine-tuned on 100% of the training set ; while base setup means using a base model fine-tuned on 50% of the training set. One could also train the re-ranker in a transfer setup fashion, meaning to train on the outputs of the model fine-tuned on 100% of the training set. This is generally super bad ML practice to train again on outputs of a model trained on this same data ; but your findings suggest that this could be explored in this summarization setup.
Yes. I totally agree this is a bad ML practice, and I am also quite confused about this finding..
Hi, Thanks for your great work. I am curious about the 3.3 Tackling Training and Inference Gap part, you split the training data 2-fold and cross generate the data in the other half. So, in theory, if you split your training data into more parts (i.e. N-fold with a large N), the distribution of training set for ranking is more close to that of test set. Have you ever tried such experiments ? Why just choosing 2-fold split ?