We need to verify the resulting data, for example:
Check how many pages (in tagged corpora) are empty and compare it to the number Wikipedia says;
Generate relative numbers (page count, token count etc.) of tagged corpora and compare with official Wikipedia numbers;
Check what was deleted (by the dump parser) to assure we're not deleting important data (for example: maybe we should not delete lists, because they have text);
We need to verify the resulting data, for example: