Open thiagocferr opened 2 years ago
Il 01/03/22 03:13, Thiago Ferreira ha scritto:
I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis
Yes, it does. We try to guess which method to use for each wiki, but in the end we can't account for the quirks of every webserver. Maybe we could add a command line option.
Federico
While trying to generate an XML dump for the Touhou Wiki with the dumpgenerator.py (master#d7b6924), I noticed that no XML besides the Main Page was being generated, with every other entry being marked as
missing in the wiki (probably deleted)
in theerrors.log
file.For example, executing:
would successfully find and load all page titles from all namespaces, and then starting "downloading pages":
But pausing the script and checking the
errors.log
file would result in:even though these pages actually exist.
Looking more into it, I was able to generate a XML dump (albeit with just one revision, as the wiki API seems to not support it) by changing the scripts' code to make a GET request, instead of POST request, during the XML extraction process. More precisely:
This seems to work because doing a POST returned an XML without a
</page>
tag for the page, with would result in aPageMissingError
during this code section:while doing a GET would result in a page XML with the closing tag, thus saving it to the main XML file
I'm not really sure why this is the case, or even if making this kind of change would break the dump generation for other wikis (I tested with the InstallGentoo Wiki as well and the XML dump seemed to work just fine).