index url computation fails on some wikis

vgambier commented 1 year ago

(I originally opened an issue on the wikiteam3 fork, so you can take a look here if you want, but I'll sum everything up here so you don't need to)

On some wikis, such as https://www.ssbwiki.com/ and https://www.mariowiki.com, wikiteam grabs the wrong index url and then the export fails with a misleading error.

$ ./dumpgenerator.py --images --xml --curonly --delay 2 https://www.ssbwiki.com
Please install the lxml module if you want to use --xmlrevisions.
Checking API... https://www.ssbwiki.com/api.php
API is OK: https://www.ssbwiki.com/api.php
Checking index.php... https://www.ssbwiki.com/Main_Page
index.php is OK
#########################################################################
# Welcome to DumpGenerator 0.4.0-alpha by WikiTeam (GPL v3)                   #
# More info at: https://github.com/WikiTeam/wikiteam                    #
#########################################################################

#########################################################################
# Copyright (C) 2011-2022 WikiTeam developers                           #

# This program is free software: you can redistribute it and/or modify  #
# it under the terms of the GNU General Public License as published by  #
# the Free Software Foundation, either version 3 of the License, or     #
# (at your option) any later version.                                   #
#                                                                       #
# This program is distributed in the hope that it will be useful,       #
# but WITHOUT ANY WARRANTY; without even the implied warranty of        #
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the         #
# GNU General Public License for more details.                          #
#                                                                       #
# You should have received a copy of the GNU General Public License     #
# along with this program.  If not, see <http://www.gnu.org/licenses/>. #
#########################################################################

Analysing https://www.ssbwiki.com/api.php
Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
Sleeping... 2 seconds...
32 namespaces found
    Retrieving titles in the namespace 0
    33255 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
    3432 titles retrieved in the namespace 1
    Retrieving titles in the namespace 2
    3648 titles retrieved in the namespace 2
    Retrieving titles in the namespace 3
    4947 titles retrieved in the namespace 3
    Retrieving titles in the namespace 4
    371 titles retrieved in the namespace 4
    Retrieving titles in the namespace 5
    158 titles retrieved in the namespace 5
    Retrieving titles in the namespace 6
    45828 titles retrieved in the namespace 6
    Retrieving titles in the namespace 7
    383 titles retrieved in the namespace 7
    Retrieving titles in the namespace 8
    258 titles retrieved in the namespace 8
    Retrieving titles in the namespace 9
    15 titles retrieved in the namespace 9
    Retrieving titles in the namespace 10
    1229 titles retrieved in the namespace 10
    Retrieving titles in the namespace 11
    270 titles retrieved in the namespace 11
    Retrieving titles in the namespace 12
    28 titles retrieved in the namespace 12
    Retrieving titles in the namespace 13
    15 titles retrieved in the namespace 13
    Retrieving titles in the namespace 14
    2556 titles retrieved in the namespace 14
    Retrieving titles in the namespace 15
    114 titles retrieved in the namespace 15
    Retrieving titles in the namespace 274
    11 titles retrieved in the namespace 274
    Retrieving titles in the namespace 275
    0 titles retrieved in the namespace 275
    Retrieving titles in the namespace 828
    0 titles retrieved in the namespace 828
    Retrieving titles in the namespace 829
    0 titles retrieved in the namespace 829
    Retrieving titles in the namespace 100
    7864 titles retrieved in the namespace 100
    Retrieving titles in the namespace 101
    505 titles retrieved in the namespace 101
    Retrieving titles in the namespace 102
    3927 titles retrieved in the namespace 102
    Retrieving titles in the namespace 103
    150 titles retrieved in the namespace 103
    Retrieving titles in the namespace 104
    219 titles retrieved in the namespace 104
    Retrieving titles in the namespace 105
    36 titles retrieved in the namespace 105
    Retrieving titles in the namespace 110
    1046 titles retrieved in the namespace 110
    Retrieving titles in the namespace 111
    30 titles retrieved in the namespace 111
    Retrieving titles in the namespace 2300
    0 titles retrieved in the namespace 2300
    Retrieving titles in the namespace 2301
    0 titles retrieved in the namespace 2301
    Retrieving titles in the namespace 2302
    0 titles retrieved in the namespace 2302
    Retrieving titles in the namespace 2303
    0 titles retrieved in the namespace 2303
Titles saved at... ssbwikicom-20221128-titles.txt
110295 page titles loaded
https://www.ssbwiki.com/api.php
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Saving in the errors log, and skipping...
Trying the local name for the Special namespace instead
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Saving in the errors log, and skipping...
XML export on this wiki is broken, quitting.

mwGetAPIAndIndexin api.py parses the HTML of the main page to get the index url, using the view source button as reference. For ssbwiki and mariowiki, the view source button sends you to https://www.ssbwiki.com/Main_Page?action=edit and https://www.mariowiki.com/Main_Page?action=edit. This is odd because the Main page button sends you to https://www.ssbwiki.com/ and https://www.mariowiki.com/. The /Main_Page urls automatically redirect you to this raw/shortened form. For comparison, the way the Archiveteam wiki works is that the Main page buttons sends you to https://wiki.archiveteam.org/index.php/Main_Page while the view source button sends you to https://wiki.archiveteam.org/index.php?title=Main_Page&action=edit, so the index variable is set to https://wiki.archiveteam.org/index.php. There are no redirects, although the raw https://wiki.archiveteam.org/ url and the https://wiki.archiveteam.org/index.php url work just as well as https://wiki.archiveteam.org/index.php/Main_Page

In order for the rest of the program to work, the index variable should have actually been set to https://www.ssbwiki.com/index.php Indeed, we then use the index variable to construct https://www.ssbwiki.com/Main_Page?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 , which does not return XML, which causes the program to crash (https://www.ssbwiki.com?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 does not work either). The correct link is https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1.

The correct index url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location - the closest I can think of would be the log in button, but that might break other wikis if we changed this behavior.

I'm aware that the README states "If the script can't find itself the API and/or index.php paths, then you can provide them", but the error message does not make it obvious that this is the issue. In fact, wikiteam prints "Checking index.php... https://www.ssbwiki.com/Main_Page index.php is OK"

I don't know enough about wikis to understand if this is something that can or should be fixed. Perhaps we could at the very least handle the error more gracefully and suggest the user manually adds the --index argument instead of simply saying "XML export on this wiki is broken", which isn't exactly accurate. But then again, I'm not sure how to instruct them to find the proper index url.

I am willing to attempt a PR if you could perhaps point me in the right direction!

vgambier commented 1 year ago

As an aside, shouldn't the Checking index.php... https://www.ssbwiki.com/Main_Page / index.php is OK messages say "Checking index"? (Same goes for other strings within checkIndex()) The rest of the program does not assume index is always index.php

nemobis commented 1 year ago

Thanks for your report. This is a bug but we have workarounds.

The correct index url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location

There are various ways of scraping it in different versions; the canonical URL is whatever https://www.ssbwiki.com/api.php?action=query&meta=siteinfo&siprop=general says. The API seems to be correct as it says /index.php and https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page ; so ideally we should pick that up, not sure why it got rewritten to a (wrong) scraped URL.

In general, however, you cannot expect the URL guesser to know all potential non-standard short URL formats. Have you tried passing the --index parameter? This is a MediaWiki 1.35 wiki so you should probably use --xmlrevisions instead.

As an aside, shouldn't the Checking index.php... https://www.ssbwiki.com/Main_Page / index.php is OK messages say "Checking index"?

No, because the index could be index.html (as in, the webserver's default page). What we need is the location of the index.php script (to which URL parameters for Special:Export can be appended). Depending on the webserver's rewrite rules, the URL could be anything, including https://www.ssbwiki.com/Main_Page .

nemobis commented 1 year ago

The reason dumpgenerator is tricked into believing that the Main Page is the index.php is that every page in this wiki behaves like it's index.php: you can append any URL parameter to any URL, except the title parameter.

For example the title parameter doesn't do anything here: https://www.ssbwiki.com/Special:Export?pages=Main+Page&curonly=1&templates=1&wpDownload=1&wpEditToken=%2B%5C&title=Special%3AListUsers

This is a very confusing and bad webserver configuration which I don't recommend and I'm not sure we should support. But, if the API gives us a good result we should use it, and if the user gives us a good index.php URL we must use it.

vgambier commented 1 year ago

No, because the index could be index.html (as in, the webserver's default page). What we need is the location of the index.php script (to which URL parameters for Special:Export can be appended). Depending on the webserver's rewrite rules, the URL could be anything, including https://www.ssbwiki.com/Main_Page .

Ah, I see, I misunderstood.

Have you tried passing the --index parameter?

Yes, it does seem to work. ./dumpgenerator.py --images --xml --curonly --delay 2 https://www.ssbwiki.com --index https://www.ssbwiki.com/index.php works as intended, at least for the first few pages (I haven't tried to generate a full dump yet). I was more thinking that if there a way to detect the webserver is configured in an odd manner, we should detect that and not simply state XML export is broken. As it stands, it's not immediately obvious the issue lies with the index so most users wouldn't realize all they need is to add the index parameter.

This is a very confusing and bad webserver configuration which I don't recommend and I'm not sure we should support.

That's what I figured but I wasn't sure, thanks for confirming. I'll try to see if we can use the correct value from the API (maybe run a full dump too) and get back to you. Thanks for the help!

vgambier commented 1 year ago

The two potential issues are:

dumpgenerator:checkIndex returns True for an index value of https://www.ssbwiki.com/Main_Page because the response contains the string <meta name="generator" content="MediaWiki 1.35.8"/>. Is this intended behavior?
If the value of index is set using the aforementioned HTML parsing, because of the if index in dumpgenerator.py:1842), we don't use the correct value assigned to index2 using the result['query']['general']['server'] + result['query']['general']['script'] method. Could the priority of these two methods be swapped without breaking other wikis? In this instance, the second method is the one that gives the correct value, so I would be tempted to call it the more canonical version, but I have no idea if my hunch is right.

WikiTeam / wikiteam

index url computation fails on some wikis #445