commoncrawl / ia-web-commons

Web archiving utility library
Apache License 2.0
9 stars 6 forks source link

WAT extractor: WARC-Date to indicate capture time #21

Closed sebastian-nagel closed 4 years ago

sebastian-nagel commented 4 years ago

The WARC-Date in the WAT record header does not indicate the capture time but the time the WAT record has been created (the WARC-Date in the JSON record payload indicates the capture time):

WARC/1.0
WARC-Type: metadata
WARC-Target-URI: CC-MAIN-20200125131641-20200125160641-00449.warc.gz
WARC-Date: 2020-02-02T07:22:51Z
...
Content-Type: application/json
Content-Length: 1238

{"Container":{... "WARC-Date":"2020-01-25T13:16:41Z" ...

According to the WAT spec the WARC-Date in the metadata record header should be "A 14-digit timestamp that represents the instant of data capture of the primary content".

sebastian-nagel commented 4 years ago

Would be in contradiction to iipc/webarchive-commons#43.