WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
729 stars 149 forks source link

commonschecker.py should check live status before complaining about empty/corrupt #291

Open nemobis opened 7 years ago

nemobis commented 7 years ago

Example recent output:

Checking Wikimedia Commons files from 2015-07-22 to 2015-07-22
== 2015-07-22 ==
Plymouth_Citybus_136_WA08LDF_(8975821135).jpg Plymouth_Citybus_136_WA08LDF_(8975821135).jpg corrupt (2690205 of 2758220 bytes)
Plymouth_Citybus_144_WA08LDZ_(7988771620).jpg Plymouth_Citybus_144_WA08LDZ_(7988771620).jpg corrupt (2028138 of 2076561 bytes)
Plymouth_Citybus_503_WF63LZC_(15359458628).jpg Plymouth_Citybus_503_WF63LZC_(15359458628).jpg corrupt (3836057 of 3846313 bytes)
Plymouth_Citybus_142_WA08LDV_(14676194620).jpg Plymouth_Citybus_142_WA08LDV_(14676194620).jpg corrupt (5589290 of 5581061 bytes)

This is just a user cropping a series of images in that date.

Similarly, if a bulk upload gets deleted between the generation of commonssql.csv and the download, commonschecker will complain that they are empty files. We should instead query the Wikimedia Commons API for current data and see if what we got is really the best we can have right now.

nemobis commented 4 years ago

Counterargument: the user running the script should use a freshly generated DB query so that the mismatch is less likely. I'll probably decline this myself the next time I run the Commons archive (I really hope someone beats me at it though!).