Open upintheairsheep opened 1 year ago
This is a less recent dataset, https://www.kaggle.com/datasets/abeserra/wikia-census
And just to note, remember to make a dump of every Fandom wiki starting with the 1000 most popular ones, then going down to all of the other wikis, then integrate previous dumps of Fandom to compensate for deleted pages and deleted wikis.
Fandom text is licensed under an open licence.
Use in congection with https://github.com/WikiTeam/wikiteam to dump the textual contents of every wiki on there(i dont know if we should include the histories or not), or maybe only the most popular 10000 wikis. We should also integrate https://archive.org/details/wikia_dump_20200214 , https://archive.org/details/wikia_dump_20180602 , and the Fandom wikis of https://archive.org/details/wikiteam for the purpose of having the contents of deleted pages and wikis in the Pile. Having every Fandom wiki in the Pile would be really beneficial for AI, as the Fandom website includes vast knowledge of fiction, and it would make every future AI have accurate knowledge of fictional stuff, as well as real-world stuff