WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

dumpgenerator.py exits after error with a single image #410

Closed sky-is-winning closed 3 years ago

sky-is-winning commented 3 years ago

Trying to download over 100,000 images from a wiki. Dumpgenerator.py gets a few thousand in, and then gets a 404 error on a single image, and exits. Should there not be an option to just skip that image and continue with the next images?

nemobis commented 3 years ago

Il 26/07/21 15:15, floogal ha scritto:

Should there not be an option to just skip that image and continue with the next images?

There is: it's called retrying manually. There's no automatic handling because nobody has yet devised a way to automatically guess whether an error is real or not and what to do about it.

burner1024 commented 3 years ago

But if you do retry, what then? It'll likely return 404 again, and exit again. The only sane workaround I see is to allow some 404 count threshold in args...

nemobis commented 3 years ago

We err on the side of caution. It's dangerous to automatically skip failed images and call a dump done anyway, because the user may incorrectly think the wiki was archived where it was not. Therefore we force a manual fix.

burner1024 commented 3 years ago

Therefore we force a manual fix.

  1. Try again, 404. Again, 404. What's the fix?
nemobis commented 3 years ago

Il 16/11/21 01:50, burner1024 ha scritto:

  1. Try again, 404. Again, 404. What's the fix?

If the image is truly missing, you need to remove manually it from the list of titles to download.

burner1024 commented 3 years ago

Oh, one can do that. Well, I guess technically that works.

If the image is truly missing

How is this supposed to be determined, then?

nemobis commented 3 years ago

Il 16/11/21 12:45, burner1024 ha scritto:

How is this supposed to be determined, then?

Probably one would start from the MediaWiki interface, see what different sets of information are sent by MediaWiki interface, MediaWiki API, webserver, other sources. If there is a disagreement, debug the likely misconfiguration or software bug. Find out what data can be pulled out anyway.

Archiving MediaWiki sites requires knowledge of MediaWiki, there's little to do about that. If you don't have intimate knowledge of MediaWiki, it's still useful to try: just make sure to note that when you archive your dumps on archive.org. If you actually need to transfer a wiki with 100k images, I'd probably recommend hiring an expert.