Open nemobis opened 6 years ago
Does not yet work for Wikia, partly because they return a blank page for exportnowrap
used in getXMLHeader()
. Have to use wikitools there as well?
File "./dumpgenerator.py", line 2195, in <module>
main()
File "./dumpgenerator.py", line 2187, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1756, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "./dumpgenerator.py", line 717, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session):
File "./dumpgenerator.py", line 792, in getXMLRevisions
for page in result['query']['allrevisions']:
KeyError: 'query'
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
Before even downloading the first revisions, there is some wiki where the export gets stuck in an endless loop of "Invalid JSON response. Trying the request again" or similar message:
Analysing http://www.haplozone.net/wiki/index.php
Trying generating a new dump into a new directory...
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
Now tested with a 1.12 wiki,http://meritbadge.org/wiki/index.php/Main_Page , courtesy https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/090004.html : 27cbdfd30241a723b44a6d1f84b69ef432ae8db4 680145e6a566c696ba31902ea3cc6c6939ce1d1a
For Wikia, the API export works without exportnowrap
: http://00eggsontoast00.wikia.com/api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&export&format=json
But facepalm, where the API help says "Export the current revisions of all given or generated pages" it really means that any revision other than the current one is ignored: http://00eggsontoast00.wikia.com/api.php?action=query&revids=3|80|85&export is the same as http://00eggsontoast00.wikia.com/api.php?action=query&revids=85&export
Here we go: https://github.com/WikiTeam/wikiteam/commit/7143f7efb1ba08cc328606bd0c7c81246f5b0ffa
It's very fast on most wikis, because it makes way less requests if your average number of revisions per page is less than 50.
The first dump produced with this method is: https://archive.org/download/wiki-ferstaberindecom_f2_en/ferstaberindecom_f2_en-20180519-history.xml.7z
And now also Wikia, without the allrevisions module: https://archive.org/details/wiki-00eggsontoast00wikiacom
The XML built "manually" with --xmlrevisions
is almost the same as usual (at the cost of making at least one request per page), but it's missing parentid
and at the moment minoredit
.
Analysing http://nimiarkisto.fi/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
29 namespaces found
Retrieving titles in the namespace 0
.Traceback (most recent call last):
File "./dumpgenerator.py", line 2288, in <module>
main()
File "./dumpgenerator.py", line 2280, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1844, in createNewDump
getPageTitles(config=config, session=other['session'])
File "./dumpgenerator.py", line 416, in getPageTitles
for title in titles:
File "./dumpgenerator.py", line 292, in getPageTitlesAPI
allpages = jsontitles['query']['allpages']
KeyError: 'query'
In testing this for Wikia, remember that the number of edits on Special:Statistics isn't always truthful (this is normal on MediaWiki). For instance http://themodifyers.wikia.com/wiki/Special:Statistics says 2333 edits, but dumpgenerator.py exports 1864, and that's the right amount: entering all the titles on themodifyers.wikia.com/wiki/Special:Export and exporting all revisions gives the same amount.
Also, a page with 53 revisions on that wiki was correctly exported, which means that API continuation works; that's something!
Not sure what's going on at http://zh.asoiaf.wikia.com/api.php
Traceback (most recent call last):
File "./dumpgenerator.py", line 2308, in <module>
main()
File "./dumpgenerator.py", line 2300, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1864, in createNewDump
getPageTitles(config=config, session=other['session'])
File "./dumpgenerator.py", line 429, in getPageTitles
for title in titles:
File "./dumpgenerator.py", line 252, in getPageTitlesAPI
config=config, session=session)
TypeError: 'NoneType' object is not iterable
tail: cannot open 'zhasoiafwikiacom-20180521-wikidump/zhasoiafwikiacom-20180521-history.xml' for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
http://zhpad.wikia.com/api.php seems to eventually fail as well
Next step: implementing resuming. I'll probably take the readTitles()
part out of getXMLRevisions()
to make things clearer.
I think it would be the occasion to make sure that we log something to error.log when we catch an exception or call sys.exit(1)
, so that it's easier to inspect failed dumps and see what happened when they stopped. I have almost 4k interrupted Wikia dumps.
Later I'll post a series of errors.log from failed dumps.
For now I tend to believe that, when the dump runs to the end, the XML really is as complete as possible. For instance, on a biggish wiki like http://finalfantasy.wikia.com/wiki/Special:Statistics :
$ grep -c "<revision>" finalfantasywikiacom-20180523-history.xml
1638424
$ grep -c "<page>" finalfantasywikiacom-20180523-history.xml
311259
That's over a million "missing" revisions compared to what Special:Statistics says, which however cannot really be trusted. The number of pages is pretty close.
On the other hand, it could be that the continuation is not working in some cases... In clubpenguinwikiacom-20180523-history.xml, I'm not sure I see the 3200 revisions that the main page ought to have.
Some wiki might be in a loop...
1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1061 more revisions exported
1062 more revisions exported
1061 more revisions exported
1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1062 more revisions exported
Or not: it seems legit, some bot is editing a series of pages every day. http://runescape.wikia.com/wiki/Module:Exchange/Dragon_crossbow_(u)/Data?limit=1000&action=history
Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 5 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 10 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 15 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 20 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 25 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 30 seconds
Sometimes allpages
works until it doesn't:
Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php
Warning!: "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump" path exists
There is a dump in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? n
You have selected: NO
Trying to use path "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump-2"...
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
16 namespaces found
Retrieving titles in the namespace 0
.. 602 titles retrieved in the namespace 0
Retrieving titles in the namespace 1
. 1 titles retrieved in the namespace 1
Retrieving titles in the namespace 2
. 3 titles retrieved in the namespace 2
Retrieving titles in the namespace 3
. 3 titles retrieved in the namespace 3
Retrieving titles in the namespace 4
.The allpages API returned nothing. Exit.
How nice some webservers are:
Titles saved at... halachipediacom-20200209-titles.txt
2364 page titles loaded
http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 503: Service Unavailable trying request again in 5 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 10 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 15 seconds
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 20 seconds
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 25 seconds
Gotta check for actual presence of the export
field in the response:
Titles saved at... aroundisleofwightinfo-20200209-titles.txt
3230 page titles loaded
http://www.aroundisleofwight.info/api.php
Getting the XML header from the API
Traceback (most recent call last):
File "./dumpgenerator.py", line 2323, in <module>
main()
File "./dumpgenerator.py", line 2315, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1882, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "./dumpgenerator.py", line 731, in generateXMLDump
header, config = getXMLHeader(config=config, session=session)
File "./dumpgenerator.py", line 471, in getXMLHeader
xml = r.json()['query']['export']['*']
KeyError: 'export'
tail: cannot open ‘aroundisleofwightinfo-20200208-wikidump/aroundisleofwightinfo-20200208-history.xml’ for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
HTTP 405:
Titles saved at... wikiainigmaeu-20200209-titles.txt
139 page titles loaded
http://wiki.ainigma.eu/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 10 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 15 seconds
Or even the query
:
Titles saved at... masu6fsk-20200209-titles.txt
247 page titles loaded
http://masu.6f.sk/api.php
Getting the XML header from the API
Traceback (most recent call last):
File "./dumpgenerator.py", line 2323, in <module>
main()
File "./dumpgenerator.py", line 2315, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1882, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "./dumpgenerator.py", line 731, in generateXMLDump
header, config = getXMLHeader(config=config, session=session)
File "./dumpgenerator.py", line 471, in getXMLHeader
xml = r.json()['query']['export']['*']
KeyError: 'query'
HTTP Error 493 :o
Titles saved at... opendiagnostixorg-20200210-titles.txt
28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
16 namespaces found
Trying to export all revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
/home/federico/.local/lib/python2.7/site-packages/wikitools/api.py:155: FutureWarning: The querycontinue option is deprecated and will be removed
in a future release, use the new queryGen function instead
for queries requring multiple requests
for queries requring multiple requests""", FutureWarning)
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
3 more revisions exported
1 more revisions exported
1 more revisions exported
4 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
5 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
HTTPError: HTTP Error 493: Forbidden WAF trying request again in 5 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 10 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 15 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 20 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 25 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 30 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 35 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 40 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 45 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 50 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 55 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 60 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 65 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 70 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 75 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 80 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 85 seconds
Examples of wikis where --xmlrevisions didn't work and and dumpgenerator had to be killed manually:
http://encyclopaedia.herdereditorial.com/w/api.php http://gobblerpedia.org/w/api.php http://opendiagnostix.org/api.php http://semantic.wiki/wiki-de/api.php http://wiki.elitesoft.com.br/api.php http://wiki.rabenthal.net/api.php http://www.insult.wiki/w/api.php http://nichework.com/w/api.php http://mediawiki.xn--klarmachen-ndert-5nb.de/api.php http://www.archi-wiki.org/api.php http://www.gremiopedia.com/api.php http://whythisway.org/w/api.php http://wiki.delia-derbyshire.net/api.php http://www.harmfrielink.nl/wiki/api.php http://en.wiki.spotwizard.org/api.php| http://fgo.wiki/api.php http://nordicnames.de/w/api.php http://nichework.com/w/api.php http://roksao.com/api.php http://secret-wiki.de/mediawiki/api.php https://evilbabes.fandom.com/api.php http://overwiki.ru/api.php http://wiki.ainigma.eu/api.php http://wiki.debianforum.de/wiki/api.php http://wiki.dcinside.com/api.php http://www.halachipedia.com/api.php http://www.icp.uni-stuttgart.de/~icp/mediawiki/api.php http://www.tinymicros.com/mediawiki/api.php
I'm not quite sure why this happens in my latest local code, will need to check:
<page>
<title>Main Page</title>
<ns>0</ns>
<id>1</id>
<redirect title="Main page" />
<revision>
<id>3677</id>
<parentid>1</parentid>
<timestamp>2018-12-19T22:15:31Z</timestamp>
<contributor>
<username>Wiki-admin</username>
<id>45</id>
</contributor>
<comment>Redirected page to [[Main page]]</comment>
<model>wikitext</model>
<format>text/x-wiki</format>
<text xml:space="preserve" bytes="23">#REDIRECT [[Main_page]]</text>
<sha1>o2jw5c565achwt31azfnu9zc2zxgqpr</sha1>
</revision>
</page>
<page>
<title>Main Page</title>
<ns>0</ns>
<id>1</id>
<revision>
<id>3677</id>
<parentid>1</parentid>
<timestamp>2018-12-19T22:15:31Z</timestamp>
<contributor>
<id>45</id>
<username>Wiki-admin</username>
</contributor>
<comment>Redirected page to [[Main page]]</comment>
<text bytes="23" space="preserve">#REDIRECT [[Main_page]]</text>
<model>wikitext</model>
<sha1>ce111c28c158bacd1ad89fbacb33e48d0e2e383f</sha1>
</revision>
<revision>
<id>1</id>
<parentid>0</parentid>
<timestamp>2018-12-13T21:14:03Z</timestamp>
<contributor>
<id>0</id>
<username>MediaWiki default</username>
</contributor>
<comment></comment>
<text bytes="735" space="preserve"><strong>MediaWiki has been installed.</strong>
Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.
== Getting started ==
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
* [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]</text>
<model>wikitext</model>
<sha1>5702e4d5fd9173246331a889294caf01a3ad3706</sha1>
</revision>
</page>
28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
File "./dumpgenerator.py", line 2363, in <module>
main()
File "./dumpgenerator.py", line 2355, in main
createNewDump(config=config, other=other)
File "./dumpgenerator.py", line 1922, in createNewDump
generateXMLDump(config=config, titles=titles, session=other['session'])
File "./dumpgenerator.py", line 755, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session):
File "./dumpgenerator.py", line 814, in getXMLRevisions
site = mwclient.Site(apiurl.netloc, apiurl.path.replace("api.php", ""))
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 131, in __init__
self.site_init()
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 153, in site_init
retry_on_error=False)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 235, in get
return self.api(action, 'GET', *args, **kwargs)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
info = self.raw_api(action, http_method, **kwargs)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
http_method=http_method)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
stream = self.connection.request(http_method, url, **args)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
timeout=timeout
File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 544, in urlopen
body=body, headers=headers)
File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 314, in _raise_timeout
if 'timed out' in str(err) or 'did not complete (read)' in str(err): # Python 2.6
TypeError: __str__ returned non-string (type SysCallError)
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.
mwclient doesn't seem to handle retries very well, need to check:
Traceback (most recent call last):
File "dumpgenerator.py", line 2375, in <module>
File "dumpgenerator.py", line 2367, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 1934, in createNewDump
getPageTitles(config=config, session=other['session'])
File "dumpgenerator.py", line 755, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session):
File "dumpgenerator.py", line 875, in getXMLRevisions
exportrequest = site.api(**exportparams)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
info = self.raw_api(action, http_method, **kwargs)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
http_method=http_method)
File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
stream = self.connection.request(http_method, url, **args)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
resp = self.send(prep, **send_kwargs)
File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
r = adapter.send(request, **kwargs)
File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 529, in send
raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gobblerpedia.org', port=443): Read timed out. (read timeout=30)
Seems fine now on a MediaWiki 1.16 wiki. There are some differences in what we get for some optional fields like parentid, userid, size of a revision; and our XML made by etree is less eager to escape Unicode characters. Hopefully doesn't matter, although we should ideally test an import on a recent MediaWiki. wikirabenthalnet-20200210-history-test.zip
HTTP Error 493 :o
This comes and goes, could try adding to status_forcelist together with 406 seen for other wikis.
Here we can do little, the index.php and api.php responses confuse the script but indeed there isn't much we can do as even the most basic response gets a DB error:
internal_api_error_DBQueryError http://masu.6f.sk/api.php?action=query&meta=siteinfo&siprop=general
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds
This is not helped by setting http_method="GET"
(https://mwclient.readthedocs.io/en/latest/reference/site.html#mwclient.client.Site.api ). It's a MediaWiki 1.21.1 wiki so allrevisions is not available, but the HTTPError prevented the exception from making us switch to the next strategy. Once we catch that, it works via GET: 49017e3f209db2e6a897ac19fc6ade92431fcab8 . Ideally we'd need to check this only once at the beginning, but it seems that the webservers do not want to afford us this luxury.
This is a misconfigured wiki, see https://github.com/WikiTeam/wikiteam/issues/355#issuecomment-584203712
http://www.halachipedia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning Invalid JSON, trying request again
This one now (MediaWiki 1.31.1) gives:
http://www.halachipedia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning 20 namespaces found Trying to export all revisions from namespace 0 Trying to get wikitext from the allrevisions API and to build the XML This mwclient version seems not to work for us. Exiting.
Sometimes
allpages
works until it doesn't:Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php
Still broken (MediaWiki 1.23)
Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)
Still broken (MediaWiki 1.27).
Analysing http://nimiarkisto.fi/w/api.php
Still broken (MediaWiki 1.31)
The number of revisions cannot always be a multiple of 50 (example from https://villainsrpg.fandom.com/ ):
Eve Man 4 more revisions exported
Event Horizon
50 more revisions exported
50 more revisions exported
Evil
50 more revisions exported
50 more revisions exported
Existence (Secret)
9 more revisions exported
Downloaded 400 pages
Extinction
10 more revisions exported
It should be 51 in https://villainsrpg.fandom.com/wiki/Evil?offset=20111224190533&action=history We're getting 49 revisions again and then the 1 we were missing. Not a big deal but not ideal either.
Ouch no, we were not using the new batch at all. Ahem.
The XML doesn't validate against the respective schema:
$ xmllint --schema ../export-0.10.xsd --noout girlfriend_karifandomcom-20200213-history.xml
...
girlfriend_karifandomcom-20200213-history.xml:76504: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
girlfriend_karifandomcom-20200213-history.xml fails to validate
But then even the vanilla Special:Export output doesn't. Makes me sad.
$ xmllint --schema export-0.10.xsd --noout /tmp/Girlfriend+Kari+Wiki-20200213070422.xml
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml:52: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml fails to validate
$ xmllint --version
xmllint: using libxml version 20909
compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma
Fine now
Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)
Fixed with API limit 50 at b162e7b14f7e2e067039891dd2614e2c3d3105ad
Analysing http://nimiarkisto.fi/w/api.php
Fixed with automatic switch to HTTPS at d543f7d4ddeaf01d690d9d66e2913cdf26222ec8
Still have to implement resume:
Analysing https://gundam.fandom.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "File:G Saviour Bugu2 rear view.JPG"
https://gundam.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
40 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
"Kurenai Musha" Red Warrior Amazing
1 more revisions exported
...So We Meet Again
5 more revisions exported
0-Riser
3 more revisions exported
It should just be a matter of passing start
to getXMLRevisions() in generateXMLDump().
I'm happy to see that we sometimes receive less than the requested 50 revisions and nothing bad happens:
"This result was truncated because it would otherwise be larger than the limit of 8388608 bytes"
nothing bad happens
Except that they didn't check whether they had revisions bigger than that: https://pvx.fandom.com/wiki/User_talk:PVX-Misfate?offset=20071116000000&limit=20&action=history
Hm, I wonder why so many errors on this MediaWiki 1.25 wiki (the XML became half of the previous round) https://archive.org/download/wiki-wikimarionorg/wikimarionorg-20200224-history.xml.7z/errors.log
2 more revisions exported
'*'
Traceback (most recent call last):
File "dumpgenerator.py", line 2528, in <module>
main()
File "dumpgenerator.py", line 2518, in main
resumePreviousDump(config=config, other=other)
File "dumpgenerator.py", line 2165, in resumePreviousDump
session=other['session'])
File "dumpgenerator.py", line 727, in generateXMLDump
for xml in getXMLRevisions(config=config, session=session, start=start):
File "dumpgenerator.py", line 829, in getXMLRevisions
yield makeXmlFromPage(page)
File "dumpgenerator.py", line 1083, in makeXmlFromPage
raise PageMissingError(page['title'], e)
__main__.PageMissingError: page 'DevStack' not found
http://www.veikkos-archiv.com/api.php fails completely
Simple command with which I found some XML files which were actually empty (only the header):
find -maxdepth 1 -type f -name "*7z" -size -500k -print0 | xargs -0 -P32 -n1 7z l | grep xml | grep -E " [0-9]{4} " | grep -Ev " [0-9]{5,} " | grep -Eo "[^ ]+$" | sed 's,.xml$,.xml.7z,g'
Wanted for various reasons. Current implementation:
--xmlrevisions
, false by default. If the default method to download wikis doesn't work for you, please try using the flag--xmlrevisions
and let us know how it went. https://groups.google.com/forum/#!topic/wikiteam-discuss/ba2K-WeRJ-0Previous takes: https://github.com/WikiTeam/wikiteam/issues/195 https://github.com/WikiTeam/wikiteam/pull/280