facebookresearch / ELI5

Scripts and links to recreate the ELI5 dataset.
Other
316 stars 40 forks source link

Did you clean the processed dataset for experiments? #12

Closed dengyang17 closed 4 years ago

dengyang17 commented 5 years ago

After downloading and processing the dataset, I got three files, (eli5_train/valid/test.json). However, I found that all the given documents are very messy. I wonder if you have done some data cleaning for the experiment? Besides, what's the final data size of the split dataset for train/valid/test?

yjernite commented 5 years ago

Hello,

Some of the documents are indeed messy: remember they are created with a simple extraction heuristic. However, we found that most contained some relevant information and that a properly trained model is able to zero in on that information.

After running the scripts, we ended up with the final size of the split as: 234063/9919/24804