WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

dumpgenerator.py when using --resume does not recognize previous uncomplete <page> tag and copies same page without deleting unfinished #340

Open Wikiteam-on-Windows opened 5 years ago

Wikiteam-on-Windows commented 5 years ago

Corruption would be dealt with better if when resuming from a connection interruption that it would recognize that it's resuming (it already sort of does, when you use the resume that it's "Removing the last chunk of past XML dump: it is probably incomplete.") but does not seem to recognize that the tag was not finished correctly and in the XML adds a new copy of the page (starting from ) before the same page was properly finished, resulting in a FrankenPage. Currently these FrankenPages can only be fixed by hand using the method mentioned here: #339