Text corpus - Githubissues

There are several public text datasets hosted on GitHub and Hugging Face. We are also aware of private datasets collected by various organisations across the country. But there is no open-source text corpus that is large enough to train a large language model for Azerbaijani.

For reference, OpenWebText dataset that was used to train GPT-2 has roughly 9B tokens, while our Azerbaijani Wikipedia dataset contains only ~110M tokens.

allmalab / problems

Text corpus #1