Raw Text and Class Labels of the Huffpost dataset

YujiaBao / Distributional-Signatures

"Few-shot Text Classification with Distributional Signatures" ICLR 2020

https://arxiv.org/abs/1908.06039

MIT License

254 stars 57 forks source link

Raw Text and Class Labels of the Huffpost dataset #35

Closed slei109 closed 2 years ago

slei109 commented 2 years ago

Hi, Could you please share the raw text and the labels of the 41 classes you chose to work with for the Huffpost dataset? Thanks!

YujiaBao commented 2 years ago

You can find the processed text and labels from this link: https://people.csail.mit.edu/yujia/files/distributional-signatures/data.zip

For the raw text and the labels, you can download them from the Kaggle website: https://www.kaggle.com/datasets/rmisra/news-category-dataset

slei109 commented 2 years ago

Thanks for your quick reply. I found that there are 200863 records in the dataset download from Kaggle while 36900 records are in your processed dataset. I wonder to reproduce the BERT-based results of HuffPost. I noticed that "huffpost_bert_uncase.json" is not applicable to the current BERT model in the previous issue, so I required the raw text. See if I can produce the new tokens.

YujiaBao commented 2 years ago

I see. I think when we created the processed data, we randomly sampled 900 examples for each class (with 41 classes there are 36900 examples in total).

To trace back to the original raw text, you can look at huffpost.json instead of huffpost_bert_uncase.json. The huffpost.json file contains the tokenized input. You can use it to match the original dataset.