Details on pre-processing

cocoxu / simplification

Text Simplification System and Dataset

GNU General Public License v3.0

123 stars 37 forks source link

Closed feralvam closed 4 years ago

feralvam commented 5 years ago

Hello,

Could you provide details on the pre-processing applied to the dataset? For example, which tokenizer was used (with which options)? Thank you.

cocoxu commented 5 years ago

I added the scripts I used for preprocessing into the repository (most were adapted from Moses, I think).

feralvam commented 5 years ago

Thanks

feralvam commented 5 years ago

Hi, What was used to truecase the files? Or are those the original sentences before being pre-processed with the scripts?

cocoxu commented 5 years ago

I believe they are the original sentences.