AI4Bharat / indicnlp_corpus

Description Describes the IndicNLP corpus and associated datasets
156 stars 23 forks source link

Availability of text corpora #1

Closed AshishSardana closed 4 years ago

AshishSardana commented 4 years ago

What would be an approximate timeline for the release of raw text corpora? Also, can you point me to other resources from where I can get free text corpora for Gujarati, Tamil, Telugu and Marathi?

anoopkunchukuttan commented 4 years ago

Hi Ashish, Our paper based on the work is under review. We will release the corpus on acceptance. We are expecting this to be around September. You can try using the Oscar Corpus for the languages you mentioned (https://traces1.inria.fr/oscar). Regards, Anoop.

AshishSardana commented 4 years ago

Thank you Anoop, I wish you the best! This dataset isn't mentioned in the indicnlp_catalog GitHub repo, you might want to hyperlink it their as well.

anoopkunchukuttan commented 4 years ago

Thanks for pointing out this oversight, I will add it to the repo.