WikiTeam / wikiteam

Tools for downloading and preserving wikis. We archive wikis, from Wikipedia to tiniest wikis. As of 2024, WikiTeam has preserved more than 600,000 wikis.
https://github.com/WikiTeam
GNU General Public License v3.0
714 stars 149 forks source link

IA items without originalurl or other metadata #209

Open nemobis opened 9 years ago

nemobis commented 9 years ago

https://archive.org/details/wiki-jackmcbarninsomnia247nl_w I note that S3 is currently very weird, probably overloaded: it sometimes hangs connections for minutes or interrupts uploads before they complete. But this is more likely to be (again) some incorrect escaping of something on our end.

nemobis commented 9 years ago

This affected over 400 wikis out of 5800 uploaded so far. SketchCow moved them to the correct collection, but we still need to fix the metadata. So, either we discover how to override previous metadata via curl or we need to switch to ia-wrapper (https://pypi.python.org/pypi/internetarchive ) anyway. Hopefully that will make things more robust, although there is clearly a limit given we're just scraping that info from random HTML.

nemobis commented 9 years ago

pir², as you noted at https://archive.org/details/wiki-ropewikicom , perhaps the worse lack is originalurl. Assigning to you because you're already working on https://github.com/jjjake/ia-wrapper#modifying-metadata-from-python ; once that's done, at some point I'll redownload those wikis and the metadata will be restored.

nemobis commented 4 years ago

It would still be nice to add originalurl and so on to the items which are missing it. There are about a thousand (although in some cases it's ok, e.g. Wikia image dumps), list attached from ~/.local/bin/ia search -f originalurl "collection:wikiteam -subject:wikispaces -collection:wikimediacommons -subject:kiwix" | grep -v originalurl | jq .identifier.

The most tedious thing is to download the files from IA, because that can take ages. We could either mass download the (smaller) items and then extract the files, or write some script with some heuristic.

2020-01_wikiteam-without-originalurl.txt

nemobis commented 4 years ago

The most tedious thing is to download the files from IA

Actually I'm (now) wrong, because the 7z viewer was implemented! Now we can download the siteinfo.json directly, as long as we know the filename of the 7z (which is trivial from the ia library): https://archive.org/download/wiki-jackmcbarninsomnia247nl_w/jackmcbarninsomnia247nl_w-20141127-wikidump.7z/siteinfo.json