Open nemobis opened 9 years ago
This affected over 400 wikis out of 5800 uploaded so far. SketchCow moved them to the correct collection, but we still need to fix the metadata. So, either we discover how to override previous metadata via curl or we need to switch to ia-wrapper (https://pypi.python.org/pypi/internetarchive ) anyway. Hopefully that will make things more robust, although there is clearly a limit given we're just scraping that info from random HTML.
pir², as you noted at https://archive.org/details/wiki-ropewikicom , perhaps the worse lack is originalurl. Assigning to you because you're already working on https://github.com/jjjake/ia-wrapper#modifying-metadata-from-python ; once that's done, at some point I'll redownload those wikis and the metadata will be restored.
It would still be nice to add originalurl and so on to the items which are missing it. There are about a thousand (although in some cases it's ok, e.g. Wikia image dumps), list attached from ~/.local/bin/ia search -f originalurl "collection:wikiteam -subject:wikispaces -collection:wikimediacommons -subject:kiwix" | grep -v originalurl | jq .identifier
.
The most tedious thing is to download the files from IA, because that can take ages. We could either mass download the (smaller) items and then extract the files, or write some script with some heuristic.
The most tedious thing is to download the files from IA
Actually I'm (now) wrong, because the 7z viewer was implemented! Now we can download the siteinfo.json directly, as long as we know the filename of the 7z (which is trivial from the ia library): https://archive.org/download/wiki-jackmcbarninsomnia247nl_w/jackmcbarninsomnia247nl_w-20141127-wikidump.7z/siteinfo.json
https://archive.org/details/wiki-jackmcbarninsomnia247nl_w I note that S3 is currently very weird, probably overloaded: it sometimes hangs connections for minutes or interrupts uploads before they complete. But this is more likely to be (again) some incorrect escaping of something on our end.