API-only export option (without Special:Export)

nemobis commented 6 years ago

Wanted for various reasons. Current implementation: --xmlrevisions, false by default. If the default method to download wikis doesn't work for you, please try using the flag --xmlrevisions and let us know how it went. https://groups.google.com/forum/#!topic/wikiteam-discuss/ba2K-WeRJ-0

Previous takes: https://github.com/WikiTeam/wikiteam/issues/195 https://github.com/WikiTeam/wikiteam/pull/280

nemobis commented 6 years ago

Does not yet work for Wikia, partly because they return a blank page for exportnowrap used in getXMLHeader(). Have to use wikitools there as well?

  File "./dumpgenerator.py", line 2195, in <module>
    main()
  File "./dumpgenerator.py", line 2187, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1756, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 717, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 792, in getXMLRevisions
    for page in result['query']['allrevisions']:
KeyError: 'query'
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis commented 6 years ago

Before even downloading the first revisions, there is some wiki where the export gets stuck in an endless loop of "Invalid JSON response. Trying the request again" or similar message:

Analysing http://www.haplozone.net/wiki/index.php
Trying generating a new dump into a new directory...
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again

nemobis commented 6 years ago

Now tested with a 1.12 wiki,http://meritbadge.org/wiki/index.php/Main_Page , courtesy https://lists.wikimedia.org/pipermail/wikitech-l/2018-May/090004.html : 27cbdfd30241a723b44a6d1f84b69ef432ae8db4 680145e6a566c696ba31902ea3cc6c6939ce1d1a

nemobis commented 6 years ago

For Wikia, the API export works without exportnowrap: http://00eggsontoast00.wikia.com/api.php?action=query&prop=revisions&meta=siteinfo&titles=Main%20Page&export&format=json

But facepalm, where the API help says "Export the current revisions of all given or generated pages" it really means that any revision other than the current one is ignored: http://00eggsontoast00.wikia.com/api.php?action=query&revids=3|80|85&export is the same as http://00eggsontoast00.wikia.com/api.php?action=query&revids=85&export

nemobis commented 6 years ago

Here we go: https://github.com/WikiTeam/wikiteam/commit/7143f7efb1ba08cc328606bd0c7c81246f5b0ffa

It's very fast on most wikis, because it makes way less requests if your average number of revisions per page is less than 50.

The first dump produced with this method is: https://archive.org/download/wiki-ferstaberindecom_f2_en/ferstaberindecom_f2_en-20180519-history.xml.7z

nemobis commented 6 years ago

And now also Wikia, without the allrevisions module: https://archive.org/details/wiki-00eggsontoast00wikiacom

The XML built "manually" with --xmlrevisions is almost the same as usual (at the cost of making at least one request per page), but it's missing parentid and at the moment minoredit.

nemobis commented 6 years ago

Analysing http://nimiarkisto.fi/w/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
29 namespaces found
    Retrieving titles in the namespace 0
.Traceback (most recent call last):
  File "./dumpgenerator.py", line 2288, in <module>
    main()
  File "./dumpgenerator.py", line 2280, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1844, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 416, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 292, in getPageTitlesAPI
    allpages = jsontitles['query']['allpages']
KeyError: 'query'

nemobis commented 6 years ago

In testing this for Wikia, remember that the number of edits on Special:Statistics isn't always truthful (this is normal on MediaWiki). For instance http://themodifyers.wikia.com/wiki/Special:Statistics says 2333 edits, but dumpgenerator.py exports 1864, and that's the right amount: entering all the titles on themodifyers.wikia.com/wiki/Special:Export and exporting all revisions gives the same amount.

Also, a page with 53 revisions on that wiki was correctly exported, which means that API continuation works; that's something!

nemobis commented 6 years ago

Not sure what's going on at http://zh.asoiaf.wikia.com/api.php

Traceback (most recent call last):
  File "./dumpgenerator.py", line 2308, in <module>
    main()
  File "./dumpgenerator.py", line 2300, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1864, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "./dumpgenerator.py", line 429, in getPageTitles
    for title in titles:
  File "./dumpgenerator.py", line 252, in getPageTitlesAPI
    config=config, session=session)
TypeError: 'NoneType' object is not iterable
tail: cannot open 'zhasoiafwikiacom-20180521-wikidump/zhasoiafwikiacom-20180521-history.xml' for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

http://zhpad.wikia.com/api.php seems to eventually fail as well

nemobis commented 6 years ago

Next step: implementing resuming. I'll probably take the readTitles() part out of getXMLRevisions() to make things clearer.

I think it would be the occasion to make sure that we log something to error.log when we catch an exception or call sys.exit(1), so that it's easier to inspect failed dumps and see what happened when they stopped. I have almost 4k interrupted Wikia dumps.

nemobis commented 6 years ago

Later I'll post a series of errors.log from failed dumps.

For now I tend to believe that, when the dump runs to the end, the XML really is as complete as possible. For instance, on a biggish wiki like http://finalfantasy.wikia.com/wiki/Special:Statistics :

$ grep -c "<revision>" finalfantasywikiacom-20180523-history.xml
1638424
$ grep -c "<page>" finalfantasywikiacom-20180523-history.xml
311259

That's over a million "missing" revisions compared to what Special:Statistics says, which however cannot really be trusted. The number of pages is pretty close.

On the other hand, it could be that the continuation is not working in some cases... In clubpenguinwikiacom-20180523-history.xml, I'm not sure I see the 3200 revisions that the main page ought to have.

nemobis commented 6 years ago

Some wiki might be in a loop...

1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1061 more revisions exported
1062 more revisions exported
1061 more revisions exported
1062 more revisions exported
1060 more revisions exported
1061 more revisions exported
1062 more revisions exported

Or not: it seems legit, some bot is editing a series of pages every day. http://runescape.wikia.com/wiki/Module:Exchange/Dragon_crossbow_(u)/Data?limit=1000&action=history

nemobis commented 4 years ago

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 5 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 10 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 15 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 20 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 25 seconds
HTTPError: HTTP Error 301: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Moved Permanently trying request again in 30 seconds

nemobis commented 4 years ago

Sometimes allpages works until it doesn't:

Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Warning!: "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump" path exists
There is a dump in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump", probably incomplete.
If you choose resume, to avoid conflicts, the parameters you have chosen in the current session will be ignored
and the parameters available in "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump/config.txt" will be loaded.
Do you want to resume ([yes, y], [no, n])? n
You have selected: NO
Trying to use path "./xn__b1amahxn__d1adxn__p1ai_w-20200209-wikidump-2"...
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
16 namespaces found
    Retrieving titles in the namespace 0
..    602 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
.    1 titles retrieved in the namespace 1
    Retrieving titles in the namespace 2
.    3 titles retrieved in the namespace 2
    Retrieving titles in the namespace 3
.    3 titles retrieved in the namespace 3
    Retrieving titles in the namespace 4
.The allpages API returned nothing. Exit.

nemobis commented 4 years ago

How nice some webservers are:

Titles saved at... halachipediacom-20200209-titles.txt
2364 page titles loaded
http://www.halachipedia.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 503: Service Unavailable trying request again in 5 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 10 seconds
HTTPError: HTTP Error 503: Service Unavailable trying request again in 15 seconds
Invalid JSON, trying request again
Invalid JSON, trying request again
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 20 seconds
HTTPError: HTTP Error 302: The HTTP server returned a redirect error that would lead to an infinite loop.
The last 30x error message was:
Found trying request again in 25 seconds

nemobis commented 4 years ago

Gotta check for actual presence of the export field in the response:

Titles saved at... aroundisleofwightinfo-20200209-titles.txt
3230 page titles loaded
http://www.aroundisleofwight.info/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'export'
tail: cannot open ‘aroundisleofwightinfo-20200208-wikidump/aroundisleofwightinfo-20200208-history.xml’ for reading: No such file or directory
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis commented 4 years ago

HTTP 405:

Titles saved at... wikiainigmaeu-20200209-titles.txt
139 page titles loaded
http://wiki.ainigma.eu/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 10 seconds
HTTPError: HTTP Error 405: Method Not Allowed trying request again in 15 seconds

nemobis commented 4 years ago

Or even the query:

Titles saved at... masu6fsk-20200209-titles.txt
247 page titles loaded
http://masu.6f.sk/api.php
Getting the XML header from the API
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2323, in <module>
    main()
  File "./dumpgenerator.py", line 2315, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1882, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 731, in generateXMLDump
    header, config = getXMLHeader(config=config, session=session)
  File "./dumpgenerator.py", line 471, in getXMLHeader
    xml = r.json()['query']['export']['*']
KeyError: 'query'

nemobis commented 4 years ago

HTTP Error 493 :o

Titles saved at... opendiagnostixorg-20200210-titles.txt
28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
16 namespaces found
Trying to export all revisions from namespace 0
Warning. Could not use allrevisions, wiki too old.
/home/federico/.local/lib/python2.7/site-packages/wikitools/api.py:155: FutureWarning: The querycontinue option is deprecated and will be removed
in a future release, use the new queryGen function instead
for queries requring multiple requests
  for queries requring multiple requests""", FutureWarning)
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
3 more revisions exported
1 more revisions exported
1 more revisions exported
4 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
5 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
1 more revisions exported
HTTPError: HTTP Error 493: Forbidden WAF trying request again in 5 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 10 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 15 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 20 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 25 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 30 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 35 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 40 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 45 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 50 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 55 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 60 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 65 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 70 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 75 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 80 seconds
URLError: <urlopen error [Errno 110] Connection timed out> trying request again in 85 seconds

nemobis commented 4 years ago

Examples of wikis where --xmlrevisions didn't work and and dumpgenerator had to be killed manually:

http://encyclopaedia.herdereditorial.com/w/api.php http://gobblerpedia.org/w/api.php http://opendiagnostix.org/api.php http://semantic.wiki/wiki-de/api.php http://wiki.elitesoft.com.br/api.php http://wiki.rabenthal.net/api.php http://www.insult.wiki/w/api.php http://nichework.com/w/api.php http://mediawiki.xn--klarmachen-ndert-5nb.de/api.php http://www.archi-wiki.org/api.php http://www.gremiopedia.com/api.php http://whythisway.org/w/api.php http://wiki.delia-derbyshire.net/api.php http://www.harmfrielink.nl/wiki/api.php http://en.wiki.spotwizard.org/api.php| http://fgo.wiki/api.php http://nordicnames.de/w/api.php http://nichework.com/w/api.php http://roksao.com/api.php http://secret-wiki.de/mediawiki/api.php https://evilbabes.fandom.com/api.php http://overwiki.ru/api.php http://wiki.ainigma.eu/api.php http://wiki.debianforum.de/wiki/api.php http://wiki.dcinside.com/api.php http://www.halachipedia.com/api.php http://www.icp.uni-stuttgart.de/~icp/mediawiki/api.php http://www.tinymicros.com/mediawiki/api.php

nemobis commented 4 years ago

I'm not quite sure why this happens in my latest local code, will need to check:

  <page>
    <title>Main Page</title>
    <ns>0</ns>
    <id>1</id>
    <redirect title="Main page" />
    <revision>
      <id>3677</id>
      <parentid>1</parentid>
      <timestamp>2018-12-19T22:15:31Z</timestamp>
      <contributor>
        <username>Wiki-admin</username>
        <id>45</id>
      </contributor>
      <comment>Redirected page to [[Main page]]</comment>
      <model>wikitext</model>
      <format>text/x-wiki</format>
      <text xml:space="preserve" bytes="23">#REDIRECT [[Main_page]]</text>
      <sha1>o2jw5c565achwt31azfnu9zc2zxgqpr</sha1>
    </revision>
  </page>
<page>
  <title>Main Page</title>
  <ns>0</ns>
  <id>1</id>
  <revision>
    <id>3677</id>
    <parentid>1</parentid>
    <timestamp>2018-12-19T22:15:31Z</timestamp>
    <contributor>
      <id>45</id>
      <username>Wiki-admin</username>
    </contributor>
    <comment>Redirected page to [[Main page]]</comment>
    <text bytes="23" space="preserve">#REDIRECT [[Main_page]]</text>
    <model>wikitext</model>
    <sha1>ce111c28c158bacd1ad89fbacb33e48d0e2e383f</sha1>
  </revision>
  <revision>
    <id>1</id>
    <parentid>0</parentid>
    <timestamp>2018-12-13T21:14:03Z</timestamp>
    <contributor>
      <id>0</id>
      <username>MediaWiki default</username>
    </contributor>
    <comment></comment>
    <text bytes="735" space="preserve">&lt;strong&gt;MediaWiki has been installed.&lt;/strong&gt;

Consult the [https://www.mediawiki.org/wiki/Special:MyLanguage/Help:Contents User's Guide] for information on using the wiki software.

== Getting started ==
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Configuration_settings Configuration settings list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:FAQ MediaWiki FAQ]
* [https://lists.wikimedia.org/mailman/listinfo/mediawiki-announce MediaWiki release mailing list]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Localisation#Translation_resources Localise MediaWiki for your language]
* [https://www.mediawiki.org/wiki/Special:MyLanguage/Manual:Combating_spam Learn how to combat spam on your wiki]</text>
    <model>wikitext</model>
    <sha1>5702e4d5fd9173246331a889294caf01a3ad3706</sha1>
  </revision>
</page>

nemobis commented 4 years ago

28095 page titles loaded
http://opendiagnostix.org/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
Traceback (most recent call last):
  File "./dumpgenerator.py", line 2363, in <module>
    main()
  File "./dumpgenerator.py", line 2355, in main
    createNewDump(config=config, other=other)
  File "./dumpgenerator.py", line 1922, in createNewDump
    generateXMLDump(config=config, titles=titles, session=other['session'])
  File "./dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "./dumpgenerator.py", line 814, in getXMLRevisions
    site = mwclient.Site(apiurl.netloc, apiurl.path.replace("api.php", ""))
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 131, in __init__
    self.site_init()
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 153, in site_init
    retry_on_error=False)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 235, in get
    return self.api(action, 'GET', *args, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 486, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 598, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 370, in send
    timeout=timeout
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 544, in urlopen
    body=body, headers=headers)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 344, in _make_request
    self._raise_timeout(err=e, url=url, timeout_value=conn.timeout)
  File "/usr/lib/python2.7/site-packages/urllib3/connectionpool.py", line 314, in _raise_timeout
    if 'timed out' in str(err) or 'did not complete (read)' in str(err):  # Python 2.6
TypeError: __str__ returned non-string (type SysCallError)
No </mediawiki> tag found: dump failed, needs fixing; resume didn't work. Exiting.

nemobis commented 4 years ago

mwclient doesn't seem to handle retries very well, need to check:

Traceback (most recent call last):
  File "dumpgenerator.py", line 2375, in <module>

  File "dumpgenerator.py", line 2367, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 1934, in createNewDump
    getPageTitles(config=config, session=other['session'])
  File "dumpgenerator.py", line 755, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session):
  File "dumpgenerator.py", line 875, in getXMLRevisions
    exportrequest = site.api(**exportparams)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 286, in api
    info = self.raw_api(action, http_method, **kwargs)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 434, in raw_api
    http_method=http_method)
  File "/home/federico/.local/lib/python2.7/site-packages/mwclient/client.py", line 395, in raw_call
    stream = self.connection.request(http_method, url, **args)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 533, in request
    resp = self.send(prep, **send_kwargs)
  File "/usr/lib/python2.7/site-packages/requests/sessions.py", line 646, in send
    r = adapter.send(request, **kwargs)
  File "/usr/lib/python2.7/site-packages/requests/adapters.py", line 529, in send
    raise ReadTimeout(e, request=request)
requests.exceptions.ReadTimeout: HTTPSConnectionPool(host='gobblerpedia.org', port=443): Read timed out. (read timeout=30)

nemobis commented 4 years ago

Seems fine now on a MediaWiki 1.16 wiki. There are some differences in what we get for some optional fields like parentid, userid, size of a revision; and our XML made by etree is less eager to escape Unicode characters. Hopefully doesn't matter, although we should ideally test an import on a recent MediaWiki. wikirabenthalnet-20200210-history-test.zip

nemobis commented 4 years ago

HTTP Error 493 :o

This comes and goes, could try adding to status_forcelist together with 406 seen for other wikis.

http://masu.6f.sk/api.php

Here we can do little, the index.php and api.php responses confuse the script but indeed there isn't much we can do as even the most basic response gets a DB error:

internal_api_error_DBQueryError http://masu.6f.sk/api.php?action=query&meta=siteinfo&siprop=general

HTTPError: HTTP Error 405: Method Not Allowed trying request again in 5 seconds

This is not helped by setting http_method="GET" (https://mwclient.readthedocs.io/en/latest/reference/site.html#mwclient.client.Site.api ). It's a MediaWiki 1.21.1 wiki so allrevisions is not available, but the HTTPError prevented the exception from making us switch to the next strategy. Once we catch that, it works via GET: 49017e3f209db2e6a897ac19fc6ade92431fcab8 . Ideally we'd need to check this only once at the beginning, but it seems that the webservers do not want to afford us this luxury.

http://www.aroundisleofwight.info/api.php

This is a misconfigured wiki, see https://github.com/WikiTeam/wikiteam/issues/355#issuecomment-584203712

http://www.halachipedia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning Invalid JSON, trying request again

This one now (MediaWiki 1.31.1) gives:

http://www.halachipedia.com/api.php Getting the XML header from the API Retrieving the XML for every page from the beginning 20 namespaces found Trying to export all revisions from namespace 0 Trying to get wikitext from the allrevisions API and to build the XML This mwclient version seems not to work for us. Exiting.

Sometimes allpages works until it doesn't:
Analysing http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Still broken (MediaWiki 1.23)

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Still broken (MediaWiki 1.27).

Analysing http://nimiarkisto.fi/w/api.php

Still broken (MediaWiki 1.31)

nemobis commented 4 years ago

The number of revisions cannot always be a multiple of 50 (example from https://villainsrpg.fandom.com/ ):

    Eve Man                                                                                                                                                                                    4 more revisions exported
    Event Horizon
50 more revisions exported
50 more revisions exported
    Evil
50 more revisions exported
50 more revisions exported
    Existence (Secret)
9 more revisions exported
Downloaded 400 pages
    Extinction
10 more revisions exported

It should be 51 in https://villainsrpg.fandom.com/wiki/Evil?offset=20111224190533&action=history We're getting 49 revisions again and then the 1 we were missing. Not a big deal but not ideal either.

Ouch no, we were not using the new batch at all. Ahem.

nemobis commented 4 years ago

The XML doesn't validate against the respective schema:

$ xmllint --schema ../export-0.10.xsd --noout girlfriend_karifandomcom-20200213-history.xml
...
girlfriend_karifandomcom-20200213-history.xml:76504: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
girlfriend_karifandomcom-20200213-history.xml fails to validate

But then even the vanilla Special:Export output doesn't. Makes me sad.

$ xmllint --schema export-0.10.xsd --noout /tmp/Girlfriend+Kari+Wiki-20200213070422.xml
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml:52: element text: Schemas validity error : Element '{http://www.mediawiki.org/xml/export-0.10/}text': This element is not expected. Expected is one of ( {http://www.mediawiki.org/xml/export-0.10/}minor, {http://www.mediawiki.org/xml/export-0.10/}comment, {http://www.mediawiki.org/xml/export-0.10/}model ).
/tmp/Girlfriend+Kari+Wiki-20200213070422.xml fails to validate
$ xmllint --version
xmllint: using libxml version 20909
   compiled with: Threads Tree Output Push Reader Patterns Writer SAXv1 FTP HTTP DTDValid HTML Legacy C14N Catalog XPath XPointer XInclude Iconv ISO8859X Unicode Regexps Automata Expr Schemas Schematron Modules Debug Zlib Lzma

nemobis commented 4 years ago

http://xn--b1amah.xn--d1ad.xn--p1ai/w/api.php

Fine now

Does not work in http://wiki.openkm.com/api.php (normal --xml --api works)

Fixed with API limit 50 at b162e7b14f7e2e067039891dd2614e2c3d3105ad

Analysing http://nimiarkisto.fi/w/api.php

Fixed with automatic switch to HTTPS at d543f7d4ddeaf01d690d9d66e2913cdf26222ec8

nemobis commented 4 years ago

Still have to implement resume:

Analysing https://gundam.fandom.com/api.php
Loading config file...
Resuming previous dump process...
Title list was completed in the previous session
Resuming XML dump from "File:G Saviour Bugu2 rear view.JPG"
https://gundam.fandom.com/api.php
Getting the XML header from the API
Retrieving the XML for every page from the beginning
40 namespaces found
Trying to export all revisions from namespace 0
Trying to get wikitext from the allrevisions API and to build the XML
Warning. Could not use allrevisions. Wiki too old?
Getting titles to export all the revisions of each
    "Kurenai Musha" Red Warrior Amazing
        1 more revisions exported
    ...So We Meet Again
        5 more revisions exported
    0-Riser
        3 more revisions exported

It should just be a matter of passing start to getXMLRevisions() in generateXMLDump().

nemobis commented 4 years ago

I'm happy to see that we sometimes receive less than the requested 50 revisions and nothing bad happens:

"This result was truncated because it would otherwise be larger than the limit of 8388608 bytes"

nemobis commented 4 years ago

nothing bad happens

Except that they didn't check whether they had revisions bigger than that: https://pvx.fandom.com/wiki/User_talk:PVX-Misfate?offset=20071116000000&limit=20&action=history

nemobis commented 4 years ago

Hm, I wonder why so many errors on this MediaWiki 1.25 wiki (the XML became half of the previous round) https://archive.org/download/wiki-wikimarionorg/wikimarionorg-20200224-history.xml.7z/errors.log

nemobis commented 4 years ago


        2 more revisions exported
'*'
Traceback (most recent call last):
  File "dumpgenerator.py", line 2528, in <module>
    main()
  File "dumpgenerator.py", line 2518, in main
    resumePreviousDump(config=config, other=other)
  File "dumpgenerator.py", line 2165, in resumePreviousDump
    session=other['session'])
  File "dumpgenerator.py", line 727, in generateXMLDump
    for xml in getXMLRevisions(config=config, session=session, start=start):
  File "dumpgenerator.py", line 829, in getXMLRevisions
    yield makeXmlFromPage(page)
  File "dumpgenerator.py", line 1083, in makeXmlFromPage
    raise PageMissingError(page['title'], e)
__main__.PageMissingError: page 'DevStack' not found

nemobis commented 4 years ago

http://www.veikkos-archiv.com/api.php fails completely

nemobis commented 4 years ago

Simple command with which I found some XML files which were actually empty (only the header):

find -maxdepth 1 -type f -name "*7z" -size -500k -print0 | xargs -0 -P32 -n1 7z l | grep xml | grep -E " [0-9]{4} " | grep -Ev " [0-9]{5,} "  | grep -Eo "[^ ]+$" | sed 's,.xml$,.xml.7z,g'

WikiTeam / wikiteam

API-only export option (without Special:Export) #311