Closed dengyang17 closed 4 years ago
Hello,
Some of the documents are indeed messy: remember they are created with a simple extraction heuristic. However, we found that most contained some relevant information and that a properly trained model is able to zero in on that information.
After running the scripts, we ended up with the final size of the split as: 234063/9919/24804
After downloading and processing the dataset, I got three files, (eli5_train/valid/test.json). However, I found that all the given documents are very messy. I wonder if you have done some data cleaning for the experiment? Besides, what's the final data size of the split dataset for train/valid/test?