N0taN3rd / node-warc

Parse And Create Web ARChive (WARC) files with node.js
MIT License
92 stars 20 forks source link

Every response has the same Record ID #18

Closed BubuAnabelas closed 5 years ago

BubuAnabelas commented 5 years ago

I was testing the new features of the library, specialy the Puppeteer's Request Capturer and the WARC Generator along with headless-chrome-crawler with the following script:

const HCCrawler = require('headless-chrome-crawler')
const { PuppeteerCapturer, PuppeteerWARCGenerator } = require('node-warc')

const warc = new PuppeteerWARCGenerator()
warc.initWARC('./test.warc', {appending: true})

const run = async () => {
  const crawler = await HCCrawler.launch({
    customCrawl: async (page, crawl) => {
      capture = new PuppeteerCapturer(page)
      await page.setRequestInterception(true)

      page.on('request', request => {
        capture.requestWillBeSent(request)
        request.continue()
      })

      const result = await crawl()

      for (let req of capture.iterateRequests()) {
        await warc.generateWarcEntry(req)
      }

      return result
    },
    maxDepth: 0
  })

  await crawler.queue({url: 'http://books.toscrape.com', skipDuplicates: true})
  await crawler.onIdle()
  warc.end()
  await crawler.close()
}

run()

It creates the WARC file without any errors but when you look into it all the WARC-Record-ID. Because of this, all the WARC-Concurrent-To fields are the same too.

One way to fix it is to create the generator, init it, write the request and close it for each request like this:

for (let req of capture.iterateRequests()) {
    const warc = new PuppeteerWARCGenerator()
    warc.initWARC('./test5.warc', {appending: true})
    await warc.generateWarcEntry(req)
    warc.end()
}

But that is 100% inefficient.

N0taN3rd commented 5 years ago

Yes indeed it does. Thanks for opening the issue! Looks like we're using the warcinfos id this._rid and not uuid(). Will be fixed shortly

N0taN3rd commented 5 years ago

@BubuAnabelas it is fixed

You can verify it by running the command below (rg is ripgrep)

rg "WARC-Record-ID:" node-warc-generated-warc.warc | cut -c17- | uniq -c | sort -nr | less
BubuAnabelas commented 5 years ago

Now the WARC-Concurrent-To field is always <urn:uuid:null> which should be the response's WARC-Record-ID