commoncrawl / commoncrawl

Common Crawl support library to access 2008-2012 crawl archives (ARC files)
490 stars 91 forks source link

Different formats ? #17

Open spydaz opened 3 months ago

spydaz commented 3 months ago

is it possible to be able to get this in Zim file format to use with https://kiwix.org/en/ this is an ofline internet project which enable for the creation of zim files an archive which can be browsed offline safeley: as well as in places in which have no access to internet such as remote locations ...

i have seen some copys of parts of this archive on the interent archive , the problem is it should be segmented by language . and placed in to these archives so that it can be a useful resourse to other whom are not data scientist but simple teacher who require offline access to such a large data resource : the files at present are for the rich man only as you need a cloud just to be able to access the files ! despite being shared on various platforms : in zim format it will be avaliable for all people to have access : in the past the shard files were even corrupt on painful download (when the internet craw was much smaller)

thanks and please consider , : If it is possible : as a user case we could say these snapshots could then be browsed by archive : hence the smaller the archives the easier it is for low tech people ! ( ie each shard should be individual to itself and non reliant on the other segments) hence being selectable !

wumpus commented 3 months ago

We have two projects on our roadmap that do most of what you want. One is a WARC-to-zip tool that will give you a zip file containing ~ 30,000 webpages plus some spreadsheets with metadata. The other is providing WARCs sorted by language.

We are unlikely to ever directly support the Zim format. But maybe this will help: https://github.com/openzim/warc2zim

Keep in mind that we're a text-only archive, so playback of the web from our files is probably going to be disappointing.