Open vgambier opened 1 year ago
As an aside, shouldn't the Checking index.php... https://www.ssbwiki.com/Main_Page
/ index.php is OK
messages say "Checking index"? (Same goes for other strings within checkIndex()
)
The rest of the program does not assume index is always index.php
Thanks for your report. This is a bug but we have workarounds.
The correct index url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location
There are various ways of scraping it in different versions; the canonical URL is whatever https://www.ssbwiki.com/api.php?action=query&meta=siteinfo&siprop=general says. The API seems to be correct as it says /index.php and https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page ; so ideally we should pick that up, not sure why it got rewritten to a (wrong) scraped URL.
In general, however, you cannot expect the URL guesser to know all potential non-standard short URL formats. Have you tried passing the --index
parameter? This is a MediaWiki 1.35 wiki so you should probably use --xmlrevisions
instead.
As an aside, shouldn't the Checking index.php... https://www.ssbwiki.com/Main_Page / index.php is OK messages say "Checking index"?
No, because the index could be index.html (as in, the webserver's default page). What we need is the location of the index.php script (to which URL parameters for Special:Export can be appended). Depending on the webserver's rewrite rules, the URL could be anything, including https://www.ssbwiki.com/Main_Page .
The reason dumpgenerator is tricked into believing that the Main Page is the index.php is that every page in this wiki behaves like it's index.php: you can append any URL parameter to any URL, except the title parameter.
For example the title parameter doesn't do anything here: https://www.ssbwiki.com/Special:Export?pages=Main+Page&curonly=1&templates=1&wpDownload=1&wpEditToken=%2B%5C&title=Special%3AListUsers
This is a very confusing and bad webserver configuration which I don't recommend and I'm not sure we should support. But, if the API gives us a good result we should use it, and if the user gives us a good index.php URL we must use it.
No, because the index could be index.html (as in, the webserver's default page). What we need is the location of the index.php script (to which URL parameters for Special:Export can be appended). Depending on the webserver's rewrite rules, the URL could be anything, including https://www.ssbwiki.com/Main_Page .
Ah, I see, I misunderstood.
Have you tried passing the --index parameter?
Yes, it does seem to work. ./dumpgenerator.py --images --xml --curonly --delay 2 https://www.ssbwiki.com --index https://www.ssbwiki.com/index.php
works as intended, at least for the first few pages (I haven't tried to generate a full dump yet). I was more thinking that if there a way to detect the webserver is configured in an odd manner, we should detect that and not simply state XML export is broken. As it stands, it's not immediately obvious the issue lies with the index so most users wouldn't realize all they need is to add the index parameter.
This is a very confusing and bad webserver configuration which I don't recommend and I'm not sure we should support.
That's what I figured but I wasn't sure, thanks for confirming. I'll try to see if we can use the correct value from the API (maybe run a full dump too) and get back to you. Thanks for the help!
The two potential issues are:
index
value of https://www.ssbwiki.com/Main_Page
because the response contains the string <meta name="generator" content="MediaWiki 1.35.8"/>
. Is this intended behavior?index
is set using the aforementioned HTML parsing, because of the if index
in dumpgenerator.py:1842), we don't use the correct value assigned to index2
using the result['query']['general']['server'] + result['query']['general']['script']
method. Could the priority of these two methods be swapped without breaking other wikis? In this instance, the second method is the one that gives the correct value, so I would be tempted to call it the more canonical version, but I have no idea if my hunch is right.
(I originally opened an issue on the wikiteam3 fork, so you can take a look here if you want, but I'll sum everything up here so you don't need to)
On some wikis, such as https://www.ssbwiki.com/ and https://www.mariowiki.com, wikiteam grabs the wrong index url and then the export fails with a misleading error.
mwGetAPIAndIndex
inapi.py
parses the HTML of the main page to get the index url, using the view source button as reference. For ssbwiki and mariowiki, the view source button sends you to https://www.ssbwiki.com/Main_Page?action=edit and https://www.mariowiki.com/Main_Page?action=edit. This is odd because the Main page button sends you to https://www.ssbwiki.com/ and https://www.mariowiki.com/. The/Main_Page
urls automatically redirect you to this raw/shortened form. For comparison, the way the Archiveteam wiki works is that the Main page buttons sends you to https://wiki.archiveteam.org/index.php/Main_Page while the view source button sends you to https://wiki.archiveteam.org/index.php?title=Main_Page&action=edit, so theindex
variable is set to https://wiki.archiveteam.org/index.php. There are no redirects, although the raw https://wiki.archiveteam.org/ url and the https://wiki.archiveteam.org/index.php url work just as well as https://wiki.archiveteam.org/index.php/Main_PageIn order for the rest of the program to work, the
index
variable should have actually been set to https://www.ssbwiki.com/index.php Indeed, we then use theindex
variable to construct https://www.ssbwiki.com/Main_Page?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 , which does not return XML, which causes the program to crash (https://www.ssbwiki.com?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1 does not work either). The correct link is https://www.ssbwiki.com/index.php?title=Special%3AExport&pages=Main_Page&action=submit&curonly=1&limit=1.The correct index url can be found elsewhere on the parsed HTML. Unfortunately I don't know if there's a canonical location - the closest I can think of would be the log in button, but that might break other wikis if we changed this behavior.
I'm aware that the README states "If the script can't find itself the API and/or index.php paths, then you can provide them", but the error message does not make it obvious that this is the issue. In fact, wikiteam prints "
Checking index.php... https://www.ssbwiki.com/Main_Page index.php is OK
"I don't know enough about wikis to understand if this is something that can or should be fixed. Perhaps we could at the very least handle the error more gracefully and suggest the user manually adds the
--index
argument instead of simply saying "XML export on this wiki is broken", which isn't exactly accurate. But then again, I'm not sure how to instruct them to find the proper index url.I am willing to attempt a PR if you could perhaps point me in the right direction!