bazingagin / npc_gzip

Code for Paper: “Low-Resource” Text Classification: A Parameter-Free Classification Method with Compressors
MIT License
1.77k stars 156 forks source link

slice of ohsumed dataset? #17

Closed cyrilou242 closed 1 year ago

cyrilou242 commented 1 year ago

Hey,

In the paper, the ohsumed dataset has 3.4k train and 4k test observations. From what I understood on hugging face https://huggingface.co/datasets/ohsumed and here http://disi.unitn.it/moschitti/corpora.htm the original dataset has way more observations.

Could you give more detail on how the dataset was obtained and where I can find it?

bazingagin commented 1 year ago

Sure. I used the data split from previous work: Yao, Liang, Chengsheng Mao, and Yuan Luo. "Graph convolutional networks for text classification." Proceedings of the AAAI conference on artificial intelligence. Vol. 33. No. 01. 2019.

Screenshot 2023-07-23 at 7 07 08 PM

You can download here: https://github.com/yao8839836/text_gcn/tree/master/data/ohsumed_single_23

cyrilou242 commented 1 year ago

Thanks @bazingagin !