Hello,
I am trying to recreate your experiment on the en-de NC11 corpus, and was wondering: did you just split the NC11 corpus into train,dev and test sets, and then clean just the train set?
Because after downloading the NC11 corpus, it appears it has 242,770 sentences, where as 238,843 + 2,169 + 2,999 ( the number of sentences in the train, dev and test sets accordingly, according to table 4 in the appendix) adds up to 244,011, so I was wondering what could account for the missing ~2000 sentences.
Thanks!
Hello, I am trying to recreate your experiment on the en-de NC11 corpus, and was wondering: did you just split the NC11 corpus into train,dev and test sets, and then clean just the train set? Because after downloading the NC11 corpus, it appears it has 242,770 sentences, where as 238,843 + 2,169 + 2,999 ( the number of sentences in the train, dev and test sets accordingly, according to table 4 in the appendix) adds up to 244,011, so I was wondering what could account for the missing ~2000 sentences. Thanks!