WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2023, WikiTeam has preserved more than 350,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
705 stars 147 forks source link

Wrong XML when downloading a Fandom wiki #443

Closed gandbg closed 1 year ago

gandbg commented 1 year ago

I'm trying to dump a Fandom wiki (Fandom is the new name for Wikia) but I'm getting a wrong XML error for every page

Trying generating a new dump into a new directory...
Loading page titles from namespaces = all
Excluding titles from namespaces = None
31 namespaces found
    Retrieving titles in the namespace 0
    161322 titles retrieved in the namespace 0
    Retrieving titles in the namespace 1
    1425 titles retrieved in the namespace 1
[...]
Titles saved at... logosfandomcom-20221104-titles.txt
883215 page titles loaded
https://logos.fandom.com/api.php
    In attempt 1, XML for "Main_Page" is wrong. Waiting 20 seconds and reloading...
    In attempt 2, XML for "Main_Page" is wrong. Waiting 40 seconds and reloading...
    In attempt 3, XML for "Main_Page" is wrong. Waiting 60 seconds and reloading...
    In attempt 4, XML for "Main_Page" is wrong. Waiting 80 seconds and reloading...
    We have retried 5 times
    MediaWiki error for "Main_Page", network error or whatever...
    Trying to save only the last revision for this page...
ATTENTION: This wiki does not allow some parameters in Special:Export, therefore pages with large histories may be truncated
Retrieving the XML for every page from "start"
    In attempt 1, XML for "!" is wrong. Waiting 20 seconds and reloading...
    In attempt 2, XML for "!" is wrong. Waiting 40 seconds and reloading...
    In attempt 3, XML for "!" is wrong. Waiting 60 seconds and reloading...
    In attempt 4, XML for "!" is wrong. Waiting 80 seconds and reloading...
    We have retried 5 times
    MediaWiki error for "!", network error or whatever...
    Trying to save only the last revision for this page...
ATTENTION: This wiki does not allow some parameters in Special:Export, therefore pages with large histories may be truncated
    !, 1 edit
    In attempt 1, XML for "!mpossible" is wrong. Waiting 20 seconds and reloading...
    Read timeout: HTTPSConnectionPool(host='logos.fandom.com', port=443): Read timed out. (read timeout=10)
    In attempt 2, XML for "!mpossible" is wrong. Waiting 40 seconds and reloading...

I've tried to use Firefox to use the parse module from api.php and everything worked perfectly. I've also tried to re-download the script (and the wiki) multiple times but with no success.

Could this issue be Fandom itself, or it's just the script messing things up?

nemobis commented 1 year ago

Il 06/11/22 17:51, GABG ha scritto:

I'm trying to dump a Fandom wiki (Fandom is the new name for Wikia) but I'm getting a wrong XML error for every page

Did you use --xmlrevisions? Not much works without.

gandbg commented 1 year ago

Sorry, i forgot to include the command I'm using. No, I didn't used --xmlrevisions, this is my command:

$ ./dumpgenerator.py --api=https://logos.fandom.com/api.php --xml --images

I've successfully dumped a few Fandom wikis in the past with the same command using different API endpoints and everything worked as it should

nemobis commented 1 year ago

Il 07/11/22 14:41, GABG ha scritto:

I've successfully dumped a few Fandom wikis in the past

How long ago? I also dumped many Wikia wikis with the default way a decade ago, but the failure rate was high and only got higher. That's why --xmlrevisions was introduced. It may not work for everything but please give it a try.

gandbg commented 1 year ago

I've tried --xmlrevisions and it's working properly, thank for the help!. The Fandom dumps are pretty recent, you can check my Internet Archive profile

nemobis commented 1 year ago

Thank you for your uploads!