allmalab / problems

Challenges to solve in Azerbaijani NLP
7 stars 0 forks source link

Text corpus #1

Open ceferisbarov opened 6 months ago

ceferisbarov commented 6 months ago

There are several public text datasets hosted on GitHub and Hugging Face. We are also aware of private datasets collected by various organisations across the country. But there is no open-source text corpus that is large enough to train a large language model for Azerbaijani.

For reference, OpenWebText dataset that was used to train GPT-2 has roughly 9B tokens, while our Azerbaijani Wikipedia dataset contains only ~110M tokens.

ceferisbarov commented 4 months ago

We are working on this. Our initial goal is a 2B dataset.