fsrrt / wikiteam

Automatically exported from code.google.com/p/wikiteam
1 stars 0 forks source link

Task force corrupt dumps (empty .desc and image files) #55

Open GoogleCodeExporter opened 8 years ago

GoogleCodeExporter commented 8 years ago
Found some errors in task force dumps:

* http://archive.org/details/wiki-spottingworldcom_wiki (corrupt images, empty 
desc http://www.spottingworld.com/File:Wikiquote-logo.svg)
* http://archive.org/details/wiki-stargate_sg1_solutionscom_w (images looks 
good but desc are empty 
http://www.stargate-sg1-solutions.com/wiki/File:Banner-3teams.jpg)

Perhaps the issue is a bad split. Some MediaWikis include the bytes field.

<text xml:space="preserve" bytes="36">Banner featuring SG1, SGA, SGU 
teams</text>

Empty descripts are not very important because the XML dump includes all pages 
in the wiki (also Image:... and File:...), but it is better if .desc files are 
not generated as 0 bytes.

Original issue reported on code.google.com by emi...@gmail.com on 14 Aug 2012 at 7:08

GoogleCodeExporter commented 8 years ago
Empty .desc fixed in #795 and 796

But images are downloaded as corrupt in some wikis.

Original comment by emi...@gmail.com on 14 Aug 2012 at 7:12

GoogleCodeExporter commented 8 years ago
Probably is better to use .xml instead of .desc for description file extension.

Original comment by emi...@gmail.com on 14 Aug 2012 at 7:17

GoogleCodeExporter commented 8 years ago
All those wikis' images are actually HTML files with error 405. Trying to 
compile a list now.

Original comment by nemow...@gmail.com on 15 Aug 2012 at 10:49

GoogleCodeExporter commented 8 years ago
I did this silly command:

   $ find . -mindepth 3 -maxdepth 3 -type f -exec awk '/<html>/ && NR < 2 {print FILENAME; nextfile}' {} \;

and I attach the output.

Original comment by nemow...@gmail.com on 15 Aug 2012 at 10:07

Attachments:

GoogleCodeExporter commented 8 years ago
So, if this is caused by error 405 it would be issue 68.

Original comment by nemow...@gmail.com on 25 Oct 2013 at 12:49

GoogleCodeExporter commented 8 years ago
emijrp, can you please check if this happens in the latest dumps too?

Original comment by nemow...@gmail.com on 31 Jan 2014 at 2:59