AI4Bharat / indicnlp_catalog

A collaborative catalog of NLP resources for Indic languages
https://ai4bharat.github.io/indicnlp_catalog
531 stars 77 forks source link

KHASI north east #237

Open dame-cell opened 6 months ago

dame-cell commented 6 months ago

https://paperswithcode.com/paper/enkhcorp1-0-an-english-khasi-corpus

In the paper they tell you where they found and how they collected the dataset

anoopkunchukuttan commented 6 months ago

Thanks for sharing

dame-cell commented 6 months ago

hey to make your life easier i already like uploaded the dataset on hugging face

https://huggingface.co/datasets/damerajee/khasi-datasets - this is the one split into sentences

https://huggingface.co/datasets/damerajee/khasi-raw-data - raw huge paragraphs