e-bug / pascal

[ACL 2020] Code and data for our paper "Enhancing Machine Translation with Dependency-Aware Self-Attention"
https://www.aclweb.org/anthology/2020.acl-main.147/
MIT License
22 stars 10 forks source link

recreate your experiments #7

Closed lovodkin93 closed 3 years ago

lovodkin93 commented 3 years ago

Hello, I am trying to recreate your experiment on the en-de NC11 corpus, and was wondering: did you just split the NC11 corpus into train,dev and test sets, and then clean just the train set? Because after downloading the NC11 corpus, it appears it has 242,770 sentences, where as 238,843 + 2,169 + 2,999 ( the number of sentences in the train, dev and test sets accordingly, according to table 4 in the appendix) adds up to 244,011, so I was wondering what could account for the missing ~2000 sentences. Thanks!

e-bug commented 3 years ago

Yes, the training set was cleaned for Pascal. You cannot clean the evaluation sets (dev & test) as, for one thing, results wouldn't be comparable.