Open spydaz opened 3 months ago
We have two projects on our roadmap that do most of what you want. One is a WARC-to-zip tool that will give you a zip file containing ~ 30,000 webpages plus some spreadsheets with metadata. The other is providing WARCs sorted by language.
We are unlikely to ever directly support the Zim format. But maybe this will help: https://github.com/openzim/warc2zim
Keep in mind that we're a text-only archive, so playback of the web from our files is probably going to be disappointing.
is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/ this is an ofline internet project which enable for the creation of zim files an archive which can be browsed offline safeley: as well as in places in which have no access to internet such as remote locations ...
i have seen some copys of parts of this archive on the interent archive , the problem is it should be segmented by language . and placed in to these archives so that it can be a useful resourse to other whom are not data scientist but simple teacher who require offline access to such a large data resource : the files at present are for the rich man only as you need a cloud just to be able to access the files ! despite being shared on various platforms : in zim format it will be avaliable for all people to have access : in the past the shard files were even corrupt on painful download (when the internet craw was much smaller)
thanks and please consider , : If it is possible : as a user case we could say these snapshots could then be browsed by archive : hence the smaller the archives the easier it is for low tech people ! ( ie each shard should be individual to itself and non reliant on the other segments) hence being selectable !