Closed sebastian-nagel closed 7 years ago
The WARC response record header field WARC-Identified-Payload-Type is defined to contain the content type / MIME type "as determined by an independent check". It will complement the HTTP Content-Type which appears to be noisy.
Implemented using Nutch's internal content type (detected by Apache Tika). WARC and cdx files of the May 2017 crawl will contain the detected type.
The WARC response record header field WARC-Identified-Payload-Type is defined to contain the content type / MIME type "as determined by an independent check". It will complement the HTTP Content-Type which appears to be noisy.