VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
454 stars 135 forks source link

Support standard WARC file format #64

Closed aecio closed 7 years ago

aecio commented 7 years ago

WARC is a standardized file format used for storing web crawl data. It's widely used for storing large scale web data collections such as CommonCrawl and ClueWeb12.

WARC ISO 28500 draft is available at: http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf

VickyRampin commented 7 years ago

@aecio I have a very good friend who works at Internet Archive if you want to connect. He's all about WARCing :wink:

aecio commented 7 years ago

@VickySteeves Thanks! I'll let you know if I have any questions!

aecio commented 7 years ago

Fixed in PR #117.