Data Split - Githubissues

L-Zhe / BTmPG

Code for paper Pushing Paraphrase Away from Original Sentence: A Multi-Round Paraphrase Generation Approach by Zhe Lin, Xiaojun Wan. This paper is accepted by Findings of ACL'21.

MIT License

13 stars 5 forks source link

Data Split #5

Closed MrShininnnnn closed 1 year ago

MrShininnnnn commented 1 year ago

For Quora, there are actually 149,263 samples in total, rather than the data split reported in the paper (129,263\3k\3k). Is there a reason why not to use the full dataset? Thanks.

gouqi666 commented 1 year ago

@MrShininnnnn hello, I found this question today, did you solve it?

L-Zhe commented 1 year ago

We are sorry for the clerical error in our paper. After detailed inspection, we find that we sample 10k question pairs for valid and test datasets respectively. Therefore, the actual number of training, valid and test datasets are 129,263/10k/10k. And we release the whole valid and test dataset's outputs in the result folder. So there are 2w samples for each round of paraphrase generation.

And actually, we find that the number of valid and test datasets also includes 1w samples respectively.