WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
714 stars 148 forks source link

Validate the completeness of the dump before exit #214

Open nemobis opened 9 years ago

nemobis commented 9 years ago

Currently, we just check that the XML is well-formed and that it ends with . We should also check that the dump wasn't interrupted before time, as it often happens when a wiki is problematic.

We download the siteinfo now, so we can compare the number of revisions and pages to the "real" one, even when the dump is already compressed, like this:

$ 7z e -so lyricswikiacom-20141223-history.xml.7z lyricswikiacom-20141223-history.xml | grep -c "<revision>"

[...]

Size:       8555812407
Compressed: 338185741
4419005

We probably want to leave some margin before retrying, or just log somewhere visible: otherwise, if the wiki sitestats are out of date, or a page is deleted, the numbers will never coincide. (The example above has 75 % revisions missing.)

nemobis commented 9 years ago

The example was from launcher.py: fixing that requires https://github.com/WikiTeam/wikiteam/issues/145

However, we could also improve checkXMLIntegrity() in a simple way: copy the list of titles to a new file, remove each title from the list as we find it in the dump, ensure there is none left. This should also fix https://github.com/WikiTeam/wikiteam/issues/199

Then there is actual validation, https://github.com/WikiTeam/wikiteam/issues/128