EleutherAI / pilev2

MIT License
13 stars 9 forks source link

Fandom #12

Open upintheairsheep opened 1 year ago

upintheairsheep commented 1 year ago

Use in congection with https://github.com/WikiTeam/wikiteam to dump the textual contents of every wiki on there(i dont know if we should include the histories or not), or maybe only the most popular 10000 wikis. We should also integrate https://archive.org/details/wikia_dump_20200214 , https://archive.org/details/wikia_dump_20180602 , and the Fandom wikis of https://archive.org/details/wikiteam for the purpose of having the contents of deleted pages and wikis in the Pile. Having every Fandom wiki in the Pile would be really beneficial for AI, as the Fandom website includes vast knowledge of fiction, and it would make every future AI have accurate knowledge of fictional stuff, as well as real-world stuff

upintheairsheep commented 1 year ago

This is a less recent dataset, https://www.kaggle.com/datasets/abeserra/wikia-census

upintheairsheep commented 1 year ago

And just to note, remember to make a dump of every Fandom wiki starting with the 1000 most popular ones, then going down to all of the other wikis, then integrate previous dumps of Fandom to compensate for deleted pages and deleted wikis.

upintheairsheep commented 8 months ago

Fandom text is licensed under an open licence.