WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
730 stars 151 forks source link

missing media in articles due to truncated file names #162

Open Erkan-Yilmaz opened 10 years ago

Erkan-Yilmaz commented 10 years ago

during download I got: "Filename is too long, truncating" (1) a grep in the xml shows that the filename is not found (2) or in the other files in dir, but the old one is still existing (3)

I didn't try yet an import, so I wonder if later on will arise problems? (e.g. missing picture in wiki pages)

(1) from runtime: Downloaded 380 images Filename is too long, truncating. Now it is: US Navy 081116-N-7544A-093 Lt. Carmen Harmon, a Navy nurse embarked aboard the amphibious assault shaed01100a4ca3c70dad521fc7616f8ac.jpg Downloaded 390 images

(2) no result for: grep "01100a4ca3c70dad521fc7616f8ac" skilledtestscom_wiki-20140713-history.xml

(3) grep "081116-N-7544A-093" skilledtestscom_wiki-20140713-history.xml

File:US Navy 081116-N-7544A-093 Lt. Carmen Harmon, a Navy nurse embarked aboard the amphibious assault ship USS Kearsarge (LHD 3), shares a computer game with children.jpg
  <comment>Picture taken from commons, see [http://commons.wikimedia.org/wiki/File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USS_Kearsarge_(LHD_3),_shares_a_computer_game_with_children.jpg].
  <text xml:space="preserve" bytes="553">Picture taken from commons, see [http://commons.wikimedia.org/wiki/File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USS_Kearsarge_(LHD_3),_shares_a_computer_game_with_children.jpg].

see [http://commons.wikimedia.org/w/index.php?title=File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USS_Kearsarge_(LHD_3),_shares_a_computer_game_with_children.jpg&amp;action=history version history at commons] for list of authors File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USSKearsarge(LHD_3),_shares_a_computer_game_with_children.jpg| File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USSKearsarge(LHD_3),_shares_a_computer_game_with_children.jpg| File:US_Navy_081116-N-7544A-093_Lt._Carmen_Harmon,_a_Navy_nurse_embarked_aboard_the_amphibious_assault_ship_USSKearsarge(LHD_3),_shares_a_computer_game_with_children.jpg|

nemobis commented 10 years ago

Erkan Yilmaz, 13/07/2014 15:51:

grep "01100a4ca3c70dad521fc7616f8ac" skilledtestscom_wiki-20140713-history.xml

Files are in the images/ directory, not in the XML. Other than using tar archives or storing files by hash there isn't much we can do...

emijrp commented 10 years ago

Truncating filenames was a "solution" for the issue of running the script in different file systems (FAT, NTFS, ext4) that allows different length limits in chars.

Obviously when you import the XML (it contains the original filenames calls [[File:...]]) and later the images (with truncated filenames) they won't fit and you will see broken redlinks.

A solution is running a script, after XML&images import, to move the pages File:Filename_truncated.jpg to File:Filename_original.jpg (using the .desc XML files that contains the original filename between < title >< /title >).

That script is easy to code. I leave it as a task for others who want to help.