WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

Is there a way to create a MediaWiki XML dump from HTML pages on web.archive.org? #482

Closed trenkert closed 5 months ago

trenkert commented 5 months ago

There are wikis preserved on archive.org which are now longer accessible on their original servers. Is there any way to download those wikis (mainly mediawiki installations) in full to import them into a fresh mediawiki installation and run them again locally?

nemobis commented 5 months ago

Il 29/06/24 21:03, Thomas Renkert ha scritto:

There are wikis preserved on archive.org which are now longer accessible on their original servers.

Yes, thousands of them.

Is there any way to download those wikis (mainly mediawiki installations) in full to import them into a fresh mediawiki installation and run them again locally?

Yes, just click the relevant download button on the sidebar or click "show all" and then copy the download URL for use with your preferred download manager (like wget).

Then see https://www.mediawiki.org/wiki/Manual:Importing_XML_dumps

trenkert commented 5 months ago

thank you, I did not mean archive.org as in archived wiki xmls dumps, but the waybackmachine with the captured pages of a wiki. The xml dump does not exist on archive.org, but the waybackmachine has the pages captured. Is it possible to reconstruct an xml dump from pages captured on wayback?

nemobis commented 5 months ago

Not really. You'll need an HTML crawler customised for MediaWiki purposes and then a script to convert the HTML back to wikitext. There are some such partial solutions in https://www.mediawiki.org/wiki/Category:Import/Export . History can't be realistically produced.

If the wiki is less than a thousand pages big, it's probably easier to copy and paste pages one by one with the VisualEditor.

trenkert commented 5 months ago

thanks!