commoncrawl / nutch

Common Crawl fork of Apache Nutch
Apache License 2.0
26 stars 2 forks source link

Add WARC field WARC-Identified-Payload-Type #3

Closed sebastian-nagel closed 7 years ago

sebastian-nagel commented 7 years ago

The WARC response record header field WARC-Identified-Payload-Type is defined to contain the content type / MIME type "as determined by an independent check". It will complement the HTTP Content-Type which appears to be noisy.

sebastian-nagel commented 7 years ago

Implemented using Nutch's internal content type (detected by Apache Tika). WARC and cdx files of the May 2017 crawl will contain the detected type.