commoncrawl / cc-pyspark

Process Common Crawl data with Python and Spark
MIT License
406 stars 86 forks source link

Test and update examples to work with ARC files of the 2008 - 2012 crawls #20

Open sebastian-nagel opened 4 years ago

sebastian-nagel commented 4 years ago

warcio is able to read ARC files as well, so it should be possible to run all examples designed to work on WARC files also on ARC files from the 2008 - 2012 crawls.