WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
728 stars 149 forks source link

dumpgenerator.py: false-positive missing pages #423

Open thiagocferr opened 2 years ago

thiagocferr commented 2 years ago

While trying to generate an XML dump for the Touhou Wiki with the dumpgenerator.py (master#d7b6924), I noticed that no XML besides the Main Page was being generated, with every other entry being marked as missing in the wiki (probably deleted) in the errors.log file.

For example, executing:

$ python2 dumpgenerator.py --api=https://en.touhouwiki.net/api.php --path ./test --xml

would successfully find and load all page titles from all namespaces, and then starting "downloading pages":

[...]
Analysing https://en.touhouwiki.net/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
24 namespaces found
    Retrieving titles in the namespace 0
    28061 titles retrieved in the namespace 0
[...]

71698 page titles loaded
https://en.touhouwiki.net/api.php
Retrieving the XML for every page from "start"
Downloaded 10 pages
[...]

But pausing the script and checking the errors.log file would result in:

2022-02-28 21:26:22: The page "!?" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity"Case:04 -Cosmic Horoscope-" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity" Case:01 -Graveyard Memory-" was missing in the wiki (probably deleted)
2022-02-28 21:26:22: The page ""Activity" Case:02 -Nightmare Counselor-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:03 -Historical Vacation-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:05 -Forgotten Paradise-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:06 -Shining Future-" was missing in the wiki (probably deleted)
2022-02-28 21:26:23: The page ""Activity" Case:07 -Dominated Realism-" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Activity" Case:08 -Midnight Syndrome-" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Everflowering" Masterpieces of Hatsunetsumiko's 2011 - 2013" was missing in the wiki (probably deleted)
2022-02-28 21:26:24: The page ""Everything but the Girl" Hatsunetsumiko's Dance Vocal Collection Vol.2" was missing in the wiki (probably deleted)

even though these pages actually exist.


Looking more into it, I was able to generate a XML dump (albeit with just one revision, as the wiki API seems to not support it) by changing the scripts' code to make a GET request, instead of POST request, during the XML extraction process. More precisely:

--- a/dumpgenerator.py
+++ b/dumpgenerator.py
@@ -579,7 +579,7 @@ def getXMLPageCore(headers={}, params={}, config={}, session=None):
                 return ''  # empty xml
         # FIXME HANDLE HTTP Errors HERE
         try:
-            r = session.post(url=config['index'], params=params, headers=headers, timeout=10)
+            r = session.get(url=config['index'], params=params, headers=headers, timeout=10)
             handleStatusCode(r)
             xml = fixBOM(r)
         except requests.exceptions.ConnectionError as e:

This seems to work because doing a POST returned an XML without a </page> tag for the page, with would result in a PageMissingError during this code section:

def getXMLPage(config={}, title='', verbose=True, session=None):
    [...]
    xml = getXMLPageCore(params=params, config=config, session=session)
    if xml == "":
        raise ExportAbortedError(config['index'])
    if not "</page>" in xml:
        raise PageMissingError(params['title'], xml)

while doing a GET would result in a page XML with the closing tag, thus saving it to the main XML file


I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis (I tested with the InstallGentoo Wiki as well and the XML dump seemed to work just fine).

nemobis commented 2 years ago

Il 01/03/22 03:13, Thiago Ferreira ha scritto:

I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis

Yes, it does. We try to guess which method to use for each wiki, but in the end we can't account for the quirks of every webserver. Maybe we could add a command line option.

Federico