RyokoAI / BigKnow2022

BigKnow2022: Bringing Language Models Up to Speed
14 stars 0 forks source link

Add the Wikipedias of all languages and other Wikimedia resources #1

Open michaelbogdan opened 1 year ago

michaelbogdan commented 1 year ago

The English Wikipedia is more than 21 GB of compressed text without media, at the same time, it makes up just a little bit more than 10% of all articles of all Wikipedias. Adding other Wikipedias thereforce increases the size of the corpus to 210 GB of compressed text spanning a truly encyclopedic amount of knowledge written mostly by humans.

Wikimedia provides regular dumps which can replace scraping. The pages of course need to be translated to this project's target data format.

The addition of other WikiMedia projects like Wikivoyage, Wiktionary, Wikisource and Wikibooks can be discussed. Wiktionary especially can be valuable as a translation reference.

Ronsor commented 1 year ago

Noted! I already have Japanese and Korean Wikipedia.

I'll soon take a look at the other Wikimedia projects.