Open ikreymer opened 5 years ago
Can you attach an example or give some more information as to in which record it happened? It would be helpful to trace the error.
I created a warc from this site using wget. I created two files, an uncompressed warc and a gzipped warc https://archive-it.org/post/the-stack-warc-file/
wget \
--page-requisites \
--recursive \
--level=1 \
--no-parent \
-e robots=off \
--warc-file=output \
--delete-after \
--no-directories \
"https://archive-it.org/post/the-stack-warc-file/"
or --no-warc-compression
The following URL images were extracted from them. https://archive-it.org/wp-content/themes/archive-it_theme/images/facebook.png
When saved by curl, the content-length was 2395 bytes and the hash value was CRC-32: 8BA873CD. But the hash value of the same image extracted from warc was CRC-32: A3FA8781. This was the same for the uncompressed and compressed versions. I am not sure if this is a wget issue or this library issue.
The attached zip file contains the following files
facebook-curl.png
facebook-warc-gz.png
facebook-warc-plain.png
output.warc
output.warc.gz
wget was run from windows 10 wsl ubuntu.
$ wget --version
GNU Wget 1.20.3 built on linux-gnu.
warc-content-tail.zip
curl download file:
warc export file:
The 'content' ArrayBuffer in the record appears to include the trailing \r\n Tested this with compressed WARCs, may not be the case for uncompressed