cocoxu / simplification

Text Simplification System and Dataset
GNU General Public License v3.0
123 stars 37 forks source link

Reason behind number of samples in validation and test for Turkcorpus #12

Closed louismartin closed 4 years ago

louismartin commented 5 years ago

Hello @cocoxu , Just out of curiosity, is there a reason that turkcorpus is split in 2000 (valid) / 359 (test) samples and not 50%/50% for example? Thank you!

cocoxu commented 5 years ago

We used 2000 (valid) for training the SMT model. We weren't sure how much training data would be enough, as there was little previous work at the time and SARI was new. Each experiment would take 2~3 days to complete -- so we didn't experiment much with different sizes of training data. Looking backward, it is probably okay to use only half of the data for training.

louismartin commented 5 years ago

Ok thanks!