The English Wikipedia is more than 21 GB of compressed text without media, at the same time, it makes up just a little bit more than 10% of all articles of all Wikipedias. Adding other Wikipedias thereforce increases the size of the corpus to 210 GB of compressed text spanning a truly encyclopedic amount of knowledge written mostly by humans.
Wikimedia provides regular dumps which can replace scraping. The pages of course need to be translated to this project's target data format.
The addition of other WikiMedia projects like Wikivoyage, Wiktionary, Wikisource and Wikibooks can be discussed. Wiktionary especially can be valuable as a translation reference.
The English Wikipedia is more than 21 GB of compressed text without media, at the same time, it makes up just a little bit more than 10% of all articles of all Wikipedias. Adding other Wikipedias thereforce increases the size of the corpus to 210 GB of compressed text spanning a truly encyclopedic amount of knowledge written mostly by humans.
Wikimedia provides regular dumps which can replace scraping. The pages of course need to be translated to this project's target data format.
The addition of other WikiMedia projects like Wikivoyage, Wiktionary, Wikisource and Wikibooks can be discussed. Wiktionary especially can be valuable as a translation reference.