Closed nemobis closed 1 year ago
The list also contains closed wikis which were made private (?):
Analysing https://trimirdi.miraheze.org/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Error: could not get namespaces from the API request.
HTTP 200
{"error":{"code":"readapidenied","info":"You need read permission to use this module.","*":"See https://trimirdi.miraheze.org/w/api.php for API usage. Subscribe to the mediawiki-api-announce mailing list at <https://lists.wikimedia.o
rg/postorius/lists/mediawiki-api-announce.lists.wikimedia.org/> for notice of API deprecations and breaking changes."},"servedby":"mw131"}
https://trimirdi.miraheze.org/wiki/Main_Page
This wiki has been automatically closed because there have been no edits or log actions made within the last 60 days. Since this wiki is private, it cannot be reopened by any user through the normal reopening request process. If this wiki is not reopened within 6 months, it may be deleted. Note: If you are a bureaucrat on this wiki, you can go to Special:ManageWiki and uncheck the "Closed" box to reopen it.
https://mario.miraheze.org redirects to https://mariopedia.org , which causes some confusion
Titles saved at... mariomirahezeorg_w-20230616-titles.txt
15091 page titles loaded
https://mario.miraheze.org/w/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
46 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Did not get a valid JSON response from the server. Check that you used the correct hostname. If you did, the server might be wrongly configured or experiencing temporary problems.
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
Better have only the final domain in the list.
Some of the domain names don't even resolve
Checking API... https://it.famepedia.org/w/api.php
Connection error: HTTPSConnectionPool(host='it.famepedia.org', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by NewConnectionError('<urllib3.connection.HTTPSConnection object at 0x7f9fde82fb50>: Failed to establish a new connection: [Errno -2] Name or service not known',))
Start retry attempt 2 in 20 seconds.
Checking API... https://it.famepedia.org/w/api.php
According to this comment, they have lost 25% of their wikis due to a hard drive failure in November 2022: https://news.ycombinator.com/item?id=36363433
And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.
@nemobis That domain was probably squatted recently:
Il 17/06/23 00:00, Thomas Nagy ha scritto:
And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity,
Please see https://wiki.archiveteam.org/index.php/Miraheze @.***
We tried to coordinate with Miraheze so that we'd have dumps of most of their wikis before any major deletions, but we don't have an exact timeline of all past wikis created and deleted and we'll probably never have one.
Now we need to focus on the wikis which are still online. Later, if/when Miraheze goes down completely, we can look for any hidden archives for missing wikis.
I'm now running the venerable checkalive.pl with a 5 seconds sleep. Someone with more patience could start running it (or checkalive.py) with higher sleep times, for example 10 seconds, so we'd have a better list within 24 hours or so.
I've updated the docs https://github.com/WikiTeam/wikiteam/commit/c09db669c9a57a6778fb4107934c996eddcc4815
The checkalive run is still ongoing, because some requests take over 10 seconds (which suggests some miraheze servers are very overloaded at the moment). So far it has found over 2000 seemingly alive wikis.
At the moment we have wikiteam items for about 2240 distinct miraheze domains. (This search doesn't find miraheze-hosted wikis outside the miraheze.org doman.)
ia search "collection:wikiteam originalurl:*miraheze*" -f originalurl | jq -r .originalurl | cut -f3 -d/ | sort -u > /tmp/2023-06-18_wikiteam_miraheze_originalurl.org
XML history dumps for about 1388 wikis are being uploaded. All the archives are also available at http://federico.kapsi.fi/tmp/mirahezeorg_202306_history.xml.7z.zip temporarily for those who need a faster download than IA permits.
Help is appreciated with verifying that the dumps for each included wiki are complete and valid. The most comprehensive way to test a dump is to actually test importing it into a recent MediaWiki installation.
I also attach the logs from the dump. There are about 75k lines in the errors.log files, mostly about empty revisions. These could be legitimate deletions or some error on our side.
One of the biggest wikis by XML size is now https://chakuwiki.miraheze.org, with over 30 GB (didn't finish yet), a significant increase from the about 300 MB in the chakuwikimirahezeorg_w-20220626-history.xml.7z dump previously uploaded by Kevin.
https://wiki.3805.co.uk/ fails with a certificate error, but the host only serves a Miraheze placeholder, so it looks like the wiki was deleted.
Connection error: HTTPSConnectionPool(host='wiki.3805.co.uk', port=443): Max retries exceeded with url: /w/api.php?action=query&meta=siteinfo&format=json (Caused by SSLError(CertificateError("hostname 'wiki.3805.co.uk' doesn't match either of '*.miraheze.org', 'miraheze.org'",),))
Another case of private wiki, with "all rights reserved" in the footer. O_o https://s.miraheze.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8
And indeed, they may have implemented automatic purging of a wiki if it had seen no recent activity, thus setting up archiving for the site should have been a priority. I'm not sure whether they will provide backup dumps for wikis that were thus removed in the past.
It is usual practice for SRE to run our backup script and upload to Internet Archive before actual database drops.
They should all be on archive.org
https://www.sekaipedia.org/wiki/Special:MediaStatistics is among the biggest by image size, with 10 GB FLAC.
Finished! I found 6168 live wikis and 1536 dead or non-MediaWiki wikis.
We're very close to the figure of 6400 wikis recently mentioned by Miraheze people, so the current list seems good enough to me.
I've used the current version of miraheze-spider.py to update the list of wikis: https://github.com/WikiTeam/wikiteam/commit/40a1f35daeb130b68f1382b8109f76a695881547
There were thousands of deletions and additions. Were there really so many wikis deleted and created?
We also need a stricter mode which would iterate through all the results and remove those which respond with an HTTP 404, like https://crystalsmp.miraheze.org/ .