N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

The content in the warcRecord includes the trailing \r\n #30

Open ikreymer opened 5 years ago

ikreymer commented 5 years ago

The 'content' ArrayBuffer in the record appears to include the trailing \r\n Tested this with compressed WARCs, may not be the case for uncompressed

BubuAnabelas commented 5 years ago

Can you attach an example or give some more information as to in which record it happened? It would be helpful to trace the error.

fushihara commented 4 months ago

I created a warc from this site using wget. I created two files, an uncompressed warc and a gzipped warc https://archive-it.org/post/the-stack-warc-file/

wget \
  --page-requisites \
  --recursive \
  --level=1 \
  --no-parent \
  -e robots=off \
  --warc-file=output \
  --delete-after \
  --no-directories \
  "https://archive-it.org/post/the-stack-warc-file/"
or --no-warc-compression

The following URL images were extracted from them. https://archive-it.org/wp-content/themes/archive-it_theme/images/facebook.png

When saved by curl, the content-length was 2395 bytes and the hash value was CRC-32: 8BA873CD. But the hash value of the same image extracted from warc was CRC-32: A3FA8781. This was the same for the uncompressed and compressed versions. I am not sure if this is a wget issue or this library issue.

The attached zip file contains the following files

facebook-curl.png
facebook-warc-gz.png
facebook-warc-plain.png
output.warc
output.warc.gz

wget was run from windows 10 wsl ubuntu.

$ wget --version
GNU Wget 1.20.3 built on linux-gnu.

warc-content-tail.zip curl download file: facebook-curl warc export file: facebook-warc-gz