N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
166 stars 26 forks source link

TypeError in node-warc causes Squidwarc to crash #45

Closed machawk1 closed 5 years ago

machawk1 commented 5 years ago

Are you submitting a bug report or a feature request?

Bug report.

What is the current behavior?

Running Squidwarc 1a19eed5a1ea0ab1c8b608c2bffe104c84771184 (latest master) using docker-compose up. After a long process, I receive a TypeError:

Attaching to squidwarc
squidwarc    | Running Crawl From Config File warcs/conf.json
squidwarc    | With great power comes great responsibility!
squidwarc    | Squidwarc is not responsible for ill behaved user supplied scripts!
squidwarc    |
squidwarc    | Crawler Operating In page-only mode
squidwarc    | Crawler Will Be Preserving 1 Seeds
squidwarc    | Crawler Will Be Generating WARC Files Using the filenamified url
squidwarc    | Crawler Generated WARCs Will Be Placed At /Squidwarc
squidwarc    | Crawler Navigating To https://www.instagram.com/visit_berlin/
squidwarc    | Crawler Navigated To https://www.instagram.com/visit_berlin/
squidwarc    | Running user script
squidwarc    | Crawler Generating WARC
squidwarc    | A Fatal Error Occurred
squidwarc    |   TypeError: Cannot read property 'software' of undefined
squidwarc    |
squidwarc    |   - warcWriterBase.js:204 PuppeteerCDPWARCGenerator.writeWarcInfoRecord
squidwarc    |     [Squidwarc]/[node-warc]/lib/writers/warcWriterBase.js:204:15
squidwarc    |
squidwarc    |   - puppeteer.js:249 PuppeteerCrawler.genWarc
squidwarc    |     /Squidwarc/lib/crawler/puppeteer.js:249:31
squidwarc    |
squidwarc    |   - puppeteerRunner.js:72 puppeteerRunner
squidwarc    |     /Squidwarc/lib/runners/puppeteerRunner.js:72:21
squidwarc    |
squidwarc    |   - next_tick.js:68 process._tickCallback
squidwarc    |     internal/process/next_tick.js:68:7
squidwarc    |
squidwarc    |
squidwarc    | Please Inform The Maintainer Of This Project About It. Information In package.json
squidwarc exited with code 0

What is the expected behavior?

To not crash.

What's your environment?

macOS 10.14.2, node.js 10.12.0, Docker 18.09.0, Squidwarc 1a19eed5a1ea0ab1c8b608c2bffe104c84771184

Other information

This is an odd error to crash on.:

    if (winfo.software == null) {
      winfo.software = `node-warc/${this._version}`
    }

Is this the most reliable way to check if the software property is defined on winfo? Isolating this code in a separate file and fabricating an example does not crash node, but executes gracefully, so I think something else is at play.

machawk1 commented 5 years ago

Per #44, this is no longer an issue in the next branch but remains in master.

N0taN3rd commented 5 years ago

fixed via PR #47