N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

WARCWriterBase Record-ID and Concurrent-To should be the same #4

Closed BubuAnabelas closed 6 years ago

BubuAnabelas commented 6 years ago

A few days ago I was trying to use headless-chrome-crawler along with node-warc to generate WARCs from headless chrome as #2 suggests. I made a few tests, start tinkering around with the code and ended up exposing the WARCWriterBase class and using it in the code that can be seen in https://github.com/yujiosaka/headless-chrome-crawler/issues/118#issuecomment-368655438.

But when the WARCs was generated I noticed that the WARC-Concurrent-To of the request didn't match the WARC-Record-ID from the response, and the writeResponseRecord method generates it's own Record ID. So shouldn't this line use the object record id? https://github.com/N0taN3rd/node-warc/blob/a9438cde4e400f1e821bb80d4188175862696319/lib/writers/warcWriterBase.js#L157 That way writeRequestRecord could use the record id of the response as the WARC-Concurrent-To field and generate it's own record id.

N0taN3rd commented 6 years ago

@BubuAnabelas consider the following from the WARC specification concerning the WARC-Concurrent-To field:

The WARC-Concurrent-To field (or fields) contains the WARC-Record-ID of any records created as part of the same capture event as the current record. A capture event comprises the information automatically gathered by a retrieval against a single WARC-Target-URI; for example, it may be represented by a ‘response’ or ‘revisit’ record plus its associated ‘request’ record. This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ with one another when they arise from a single capture event. (When so used, any WARC- Concurrent-To association shall be considered bidirectional even if the header only appears on one record.) The WARC-Concurrent-To field shall not be used in ‘warcinfo’, ‘conversion’, and ‘continuation’ records. As an exception to the general rule, several WARC-Concurrent-To fields may be repeated within the same WARC record.

The WARC-Concurrent-To field is an optional field of WARC records but the WARC-Record-ID is mandatory. node-warc only adds the WARC-Concurrent-To field to requests records in order to associate each request with the root WARC record for a page (warcinfo):

A ‘warcinfo’ record describes the records that follow it, up through end of file, end of input, or until next ‘warcinfo’ record. Typically, this appears once and at the beginning of a WARC file. For a web archive, it often contains information about the web crawl which generated the following records

https://github.com/N0taN3rd/node-warc/blob/a9438cde4e400f1e821bb80d4188175862696319/lib/writers/warcWriterBase.js#L133-L139 See also Heritrix and the IIPC webarchive-commons libraries WARC format and records (Java).

I hope this information helps in understanding how WARC files are organized.

Is there anything that node-warc can do to make using the library with headless-chrome-crawler easier or more convenient?

BubuAnabelas commented 6 years ago

As far as I understand and the specification says, the WARC-Concurrent-To field can only contain the WARC-Record-ID of ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ records.

This field may be used to associate records of types ‘request’, ‘response’, ‘resource’, ‘metadata’, and ‘revisit’ with one another when they arise from a single capture event.

To associate a 'warcinfo' and the rest of the records there's the WARC-Warcinfo-ID field. See 10.1, 10.2 and 10.4 examples of the Annex B.

The ‘warcinfo’ record has it's own record id.

WARC-Record-ID:

The ‘response’ record has it's own record id and it references the 'warcinfo' record id.

WARC-Record-ID: WARC-Warcinfo-ID:

The ‘request’ record has it's own record id, it references the 'warcinfo' record id and it also references the 'request' record id as it's concurrent id.

WARC-Record-ID: WARC-Warcinfo-ID: WARC-Concurrent-To:

That's what I meant when I open the issue and without reading the whole spec.

In the next days I'll continue experimenting with the headless-chrome-crawler and if there's something that makes it easier I'll tell you!

N0taN3rd commented 6 years ago

@BubuAnabelas I see what you mean and thank you for bringing this to my attention! I had not considered the utility gained by adding the WARC-Warcinfo-ID as a field of the request and response records. Would you be interesting in taking this issue and implementing this feature? I am currently tied up with finishing a masters thesis (hence the delay in my responses) and will be unable to add this feature until it is completed.

BubuAnabelas commented 6 years ago

Sure, I'll make a fork and start changing what I see odd and asking for your advice. When I think that the WarcWriterBase class is ready we can continue looking for these type of errors in the rest of the code.

BubuAnabelas commented 6 years ago

I almost finished changing the WARCWriterBase class but I think that when appending to an already created WARC, it should seek for the last 'warcinfo' record (if there is one) and use that as the WARC-Warcinfo-ID field.

What do you think?

https://github.com/BubuAnabelas/node-warc/blob/master/lib/writers/warcWriterBase.js

N0taN3rd commented 6 years ago

@BubuAnabelas would you mind opening a PR so that the review can happen there? I'm liking what I see especially how you updated the WARC fields generation!