N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

Every Concurrent-To field is null. #20

Closed BubuAnabelas closed 5 years ago

BubuAnabelas commented 5 years ago

I don't know if you've seen my latest comment in the #18 issue so I open this new one just in case.

When fixing #18 it broke the WARC-Concurrent-To field, make in it <urn:uuid:null>. I used the same code to test this as in #18. This issue may be related to #4.

According to ISO 28500:

This field may be used to associate records of types 'request', 'response', 'resource', 'metadata', and 'revisit' with one another when they arise from a single capture event. (When so used, any WARC-Concurrent-To association shall be considered bidirectional even if the header only appears on one)

So it should be the same if the field is either in the 'request' or the 'response' record.

N0taN3rd commented 5 years ago

@BubuAnabelas thanks for opening the issue. I did miss your comment in #18 I believe I have fixed this and the changes are on the dev branch.

Also per your example code in #18 you do not need to do

capture = new PuppeteerCapturer(page)
....
page.on('request', request => {
  capture.requestWillBeSent(request)
  request.continue()
})

This will cause double entries per request, the capturer does all that for you so below will suffice :)

capture = new PuppeteerCapturer(page)

If that does not fix your issue please let me know

BubuAnabelas commented 5 years ago

That one seems fixed, now I've noticed it adds the WARC-Warcinfo-ID field but there isn't a warcinfo record. In addition, the request header adds a newline just above the WARC-Warcinfo-ID which breaks the header:

https://github.com/N0taN3rd/node-warc/blob/49c34256bccbae8c8eebeea00acf8ebd8247fba1/lib/writers/warcFields.js#L104

Also, for some reason, the request header misses a newline (beneath Content-Length) which leaves the record and the HTTP headers all together.

Both the WARC-Warcinfo-ID nor the missing newline in the request where happening in the previous commit.

N0taN3rd commented 5 years ago

@BubuAnabelas yup should be fixed with https://github.com/N0taN3rd/node-warc/commit/52ca586f554a2cf5254ce7b3f4b241c4c389818a

BubuAnabelas commented 5 years ago

Seems better now, but the WARC-Concurrent-To field is missing. I'll do a PR in the following days to help you with this if you don't mind, to help you with this issue. Besides this, the library works like a charm, very good job!

BubuAnabelas commented 5 years ago

Everything works fine for now, closing this issue.