WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

Update miraheze.org list without dead wikis #465

Closed nemobis closed 1 year ago

nemobis commented 1 year ago

I've used the current version of miraheze-spider.py to update the list of wikis: https://github.com/WikiTeam/wikiteam/commit/40a1f35daeb130b68f1382b8109f76a695881547

There were thousands of deletions and additions. Were there really so many wikis deleted and created?

We also need a stricter mode which would iterate through all the results and remove those which respond with an HTTP 404, like https://crystalsmp.miraheze.org/ .

nemobis commented 1 year ago

https://github.com/miraheze/MirahezeMagic/blob/master/py/generateSitemapIndex.py

nemobis commented 1 year ago

The list also contains closed wikis which were made private (?):

Analysing https://trimirdi.miraheze.org/w/api.php                                                                     
Trying generating a new dump into a new directory...                                                                  
Loading page titles from namespaces = all                                                                             
Excluding titles from namespaces = None                                                                               
Error: could not get namespaces from the API request.
HTTP 200                                      
{"error":{"code":"readapidenied","info":"You need read permission to use this module.","*":"See https://trimirdi.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.o
rg/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"mw131"}

https://trimirdi.miraheze.org/wiki/Main_Page

This wiki has been automatically closed because there have been no edits or log actions made within the last 60 days. Since this wiki is private, it cannot be reopened by any user through the normal reopening request process. If this wiki is not reopened within 6 months, it may be deleted. Note: If you are a bureaucrat on this wiki, you can go to Special:ManageWiki and uncheck the "Closed" box to reopen it.

nemobis commented 1 year ago

https://mario.miraheze.org redirects to https://mariopedia.org , which causes some confusion

Titles saved at... mariomirahezeorg_w-20230616-titles.txt
15091 page titles loaded
https://mario.miraheze.org/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
46 namespaces found          
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Did not get a valid JSON response from the server. Check that you used the correct hostname. If you did, the server might be wrongly configured or experiencing temporary problems.
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each

Better have only the final domain in the list.

nemobis commented 1 year ago

Some of the domain names don't even resolve

Checking API... https://it.famepedia.org/w/api.php
Connection error: HTTPSConnectionPool(host='it.famepedia.org', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9fde82fb50>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Start retry attempt 2 in 20 seconds.
Checking API... https://it.famepedia.org/w/api.php
bkil commented 1 year ago

According to this comment, they have lost 25% of their wikis due to a hard drive failure in November 2022: https://news.ycombinator.com/item?id=36363433

bkil commented 1 year ago

And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.

bkil commented 1 year ago

@nemobis That domain was probably squatted recently:

nemobis commented 1 year ago

Il 17/06/23 00:00, Thomas Nagy ha scritto:

And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity,

Please see https://wiki.archiveteam.org/index.php/Miraheze @.***

We tried to coordinate with Miraheze so that we'd have dumps of most of their wikis before any major deletions, but we don't have an exact timeline of all past wikis created and deleted and we'll probably never have one.

Now we need to focus on the wikis which are still online. Later, if/when Miraheze goes down completely, we can look for any hidden archives for missing wikis.

nemobis commented 1 year ago

I'm now running the venerable checkalive.pl with a 5 seconds sleep. Someone with more patience could start running it (or checkalive.py) with higher sleep times, for example 10 seconds, so we'd have a better list within 24 hours or so.

I've updated the docs https://github.com/WikiTeam/wikiteam/commit/c09db669c9a57a6778fb4107934c996eddcc4815

nemobis commented 1 year ago

The checkalive run is still ongoing, because some requests take over 10 seconds (which suggests some miraheze servers are very overloaded at the moment). So far it has found over 2000 seemingly alive wikis.

At the moment we have wikiteam items for about 2240 distinct miraheze domains. (This search doesn't find miraheze-hosted wikis outside the miraheze.org doman.)

ia search "collection:wikiteam originalurl:*miraheze*" -f originalurl | jq -r .originalurl | cut -f3 -d/ | sort -u > /tmp/2023-06-18_wikiteam_miraheze_originalurl.org

2023-06-18_wikiteam_miraheze_originalurl.org.gz

nemobis commented 1 year ago

XML history dumps for about 1388 wikis are being uploaded. All the archives are also available at http://federico.kapsi.fi/tmp/mirahezeorg_202306_history.xml.7z.zip temporarily for those who need a faster download than IA permits.

Help is appreciated with verifying that the dumps for each included wiki are complete and valid. The most comprehensive way to test a dump is to actually test importing it into a recent MediaWiki installation.

I also attach the logs from the dump. There are about 75k lines in the errors.log files, mostly about empty revisions. These could be legitimate deletions or some error on our side.

mirahezeorg_202306_logs.zip

nemobis commented 1 year ago

One of the biggest wikis by XML size is now https://chakuwiki.miraheze.org, with over 30 GB (didn't finish yet), a significant increase from the about 300 MB in the chakuwikimirahezeorg_w-20220626-history.xml.7z dump previously uploaded by Kevin.

nemobis commented 1 year ago

https://wiki.3805.co.uk/ fails with a certificate error, but the host only serves a Miraheze placeholder, so it looks like the wiki was deleted.

Connection error: HTTPSConnectionPool(host='wiki.3805.co.uk', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by SSLError(CertificateError("hostname 'wiki.3805.co.uk' doesn't match either of '*.miraheze.org', 'miraheze.org'",),))
nemobis commented 1 year ago

Another case of private wiki, with "all rights reserved" in the footer. O_o https://s.miraheze.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

RhinosF1 commented 1 year ago

And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.

It is usual practice for SRE to run our backup script and upload to Internet Archive before actual database drops.

They should all be on archive.org

nemobis commented 1 year ago

https://www.sekaipedia.org/wiki/Special:MediaStatistics is among the biggest by image size, with 10 GB FLAC.

nemobis commented 1 year ago

Finished! I found 6168 live wikis and 1536 dead or non-MediaWiki wikis.

nemobis commented 1 year ago

We're very close to the figure of 6400 wikis recently mentioned by Miraheze people, so the current list seems good enough to me.