Closed slei109 closed 2 years ago
You can find the processed text and labels from this link: https://people.csail.mit.edu/yujia/files/distributional-signatures/data.zip
For the raw text and the labels, you can download them from the Kaggle website: https://www.kaggle.com/datasets/rmisra/news-category-dataset
Thanks for your quick reply. I found that there are 200863 records in the dataset download from Kaggle while 36900 records are in your processed dataset. I wonder to reproduce the BERT-based results of HuffPost. I noticed that "huffpost_bert_uncase.json" is not applicable to the current BERT model in the previous issue, so I required the raw text. See if I can produce the new tokens.
I see. I think when we created the processed data, we randomly sampled 900 examples for each class (with 41 classes there are 36900 examples in total).
To trace back to the original raw text, you can look at huffpost.json
instead of huffpost_bert_uncase.json
. The huffpost.json
file contains the tokenized input. You can use it to match the original dataset.
Hi, Could you please share the raw text and the labels of the 41 classes you chose to work with for the Huffpost dataset? Thanks!