Closed aecio closed 7 years ago
WARC is a standardized file format used for storing web crawl data. It's widely used for storing large scale web data collections such as CommonCrawl and ClueWeb12.
WARC ISO 28500 draft is available at: http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf
@aecio I have a very good friend who works at Internet Archive if you want to connect. He's all about WARCing :wink:
@VickySteeves Thanks! I'll let you know if I have any questions!
Fixed in PR #117.
WARC is a standardized file format used for storing web crawl data. It's widely used for storing large scale web data collections such as CommonCrawl and ClueWeb12.
WARC ISO 28500 draft is available at: http://bibnum.bnf.fr/warc/WARC_ISO_28500_version1_latestdraft.pdf