N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

WARCWriterBase Record-ID and Concurrent-To should be the same #4

Closed BubuAnabelas closed 6 years ago

BubuAnabelas commented 6 years ago

A few days ago I was trying to use headless-chrome-crawler along with node-warc to generate WARCs from headless chrome as #2 suggests. I made a few tests, start tinkering around with the code and ended up exposing the WARCWriterBase class and using it in the code that can be seen in https://github.com/yujiosaka/headless-chrome-crawler/issues/118#issuecomment-368655438.

But when the WARCs was generated I noticed that the WARC-Concurrent-To of the request didn't match the WARC-Record-ID from the response, and the writeResponseRecord method generates it's own Record ID. So shouldn't this line use the object record id? https://github.com/N0taN3rd/node-warc/blob/a9438cde4e400f1e821bb80d4188175862696319/lib/writers/warcWriterBase.js#L157 That way writeRequestRecord could use the record id of the response as the WARC-Concurrent-To field and generate it's own record id.

N0taN3rd commented 6 years ago

@BubuAnabelas consider the following from the WARC specification concerning the WARC-Concurrent-To field:

The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of the same capture event as the current record. A capture event comprises the information automatically gathered by a retrieval against a single WARC-Target-URI; for example, it may be represented by a ‘response’ or ‘revisit’ record plus its associated ‘request’ record. This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ with one another when they arise from a single capture event. (When so used, any WARC- Concurrent-To association shall be considered bidirectional even if the header only appears on one record.) The WARC-Concurrent-To field shall not be used in ‘warcinfo’, ‘conversion’, and ‘continuation’ records. As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

The WARC-Concurrent-To field is an optional field of WARC records but the WARC-Record-ID is mandatory. node-warc only adds the WARC-Concurrent-To field to requests records in order to associate each request with the root WARC record for a page (warcinfo):

A ‘warcinfo’ record describes the records that follow it, up through end of file, end of input, or until next ‘warcinfo’ record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains information about the web crawl which generated the following records

https://github.com/N0taN3rd/node-warc/blob/a9438cde4e400f1e821bb80d4188175862696319/lib/writers/warcWriterBase.js#L133-L139 See also Heritrix and the IIPC webarchive-commons libraries WARC format and records (Java).

I hope this information helps in understanding how WARC files are organized.

Is there anything that node-warc can do to make using the library with headless-chrome-crawler easier or more convenient?

BubuAnabelas commented 6 years ago

As far as I understand and the specification says, the WARC-Concurrent-To field can only contain the WARC-Record-ID of ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ records.

This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ with one another when they arise from a single capture event.

To associate a 'warcinfo' and the rest of the records there's the WARC-Warcinfo-ID field. See 10.1, 10.2 and 10.4 examples of the Annex B.

The ‘warcinfo’ record has it's own record id.


The ‘response’ record has it's own record id and it references the 'warcinfo' record id.

WARC-Record-ID: WARC-Warcinfo-ID:

The ‘request’ record has it's own record id, it references the 'warcinfo' record id and it also references the 'request' record id as it's concurrent id.

WARC-Record-ID: WARC-Warcinfo-ID: WARC-Concurrent-To:

That's what I meant when I open the issue and without reading the whole spec.

In the next days I'll continue experimenting with the headless-chrome-crawler and if there's something that makes it easier I'll tell you!

N0taN3rd commented 6 years ago

@BubuAnabelas I see what you mean and thank you for bringing this to my attention! I had not considered the utility gained by adding the WARC-Warcinfo-ID as a field of the request and response records. Would you be interesting in taking this issue and implementing this feature? I am currently tied up with finishing a masters thesis (hence the delay in my responses) and will be unable to add this feature until it is completed.

BubuAnabelas commented 6 years ago

Sure, I'll make a fork and start changing what I see odd and asking for your advice. When I think that the WarcWriterBase class is ready we can continue looking for these type of errors in the rest of the code.

BubuAnabelas commented 6 years ago

I almost finished changing the WARCWriterBase class but I think that when appending to an already created WARC, it should seek for the last 'warcinfo' record (if there is one) and use that as the WARC-Warcinfo-ID field.

What do you think?


N0taN3rd commented 6 years ago

@BubuAnabelas would you mind opening a PR so that the review can happen there? I'm liking what I see especially how you updated the WARC fields generation!