Closed sebastian-nagel closed 8 years ago
Thanks @sebastian-nagel, will apply your patch. In the future could you please open a PR? It will make it easier to review and comment on your contribs. Thanks!
Merged. Thanks for the detective work @sebastian-nagel
The warcinfo record returned by WARCRecordFormat lacks a trailing CRLF which causes some WARC libraries fail to read the WARC files, see commoncrawl/news-crawl#11. The WARC spec defines that the payload/block of a record is followed by
CRLF CRLF
. That's strictly speaking the case but the first CRLF is included when the Content-length header field is calculated. However, the widely used practice are 3 CRLF (one included in block and Content-length, two as record separator).