DigitalPebble / sc-warc

WARC resources for StormCrawler
2 stars 1 forks source link

Missing trailing CRLF after warcinfo record #11

Closed sebastian-nagel closed 8 years ago

sebastian-nagel commented 8 years ago

The warcinfo record returned by WARCRecordFormat lacks a trailing CRLF which causes some WARC libraries fail to read the WARC files, see commoncrawl/news-crawl#11. The WARC spec defines that the payload/block of a record is followed by CRLF CRLF. That's strictly speaking the case but the first CRLF is included when the Content-length header field is calculated. However, the widely used practice are 3 CRLF (one included in block and Content-length, two as record separator).

jnioche commented 8 years ago

Thanks @sebastian-nagel, will apply your patch. In the future could you please open a PR? It will make it easier to review and comment on your contribs. Thanks!

jnioche commented 8 years ago

Merged. Thanks for the detective work @sebastian-nagel