N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
166 stars 26 forks source link

Fatal error when starting crawl #53

Open nvanderperren opened 3 years ago

nvanderperren commented 3 years ago

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

I get an error when I want to start a crawl. This is the error

Running Crawl From Config File configurations/social-media.json
Crawler Operating In undefined mode
Crawler Will Be Preserving 2 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At warcs
Crawler Is Connecting To Chrome On Host localhost
Crawler Is Connecting To Chrome On Port 9222
Crawler Will Be Waiting At Maximum For Navigation To Happen For 8s
Crawler Will Be Waiting After For 2 inflight requests
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Will Be Generating WARC Files Using the filenamified url
A Fatal Error Occurred
  TypeError: Cannot read property 'length' of undefined

  - chromeFinder.js:275 Function.findChromeDarwin
    /Users/nastasia/Developer/Squidwarc/lib/launcher/chromeFinder.js:275:20

  - chrome.js:90 async Function.launch
    /Users/nastasia/Developer/Squidwarc/lib/launcher/chrome.js:90:28

  - chrome.js:143 async ChromeCrawler.init
    /Users/nastasia/Developer/Squidwarc/lib/crawler/chrome.js:143:22

  - chromeRunner.js:143 async chromeRunner
    /Users/nastasia/Developer/Squidwarc/lib/runners/chromeRunner.js:143:3

  - index.js:31 async runner
    /Users/nastasia/Developer/Squidwarc/lib/runners/index.js:31:5

This is my configuration file:

{
    "mode": "page-only",
    "depth": 1,
    "seeds": [
        "http://www.facebook.com/nastyvdp",
        "http://www.twitter.com/nvanderperren"
    ],
    "warc": {
        "naming": "url",
        "append": "true",
        "output": "warcs"
    },
    "connect": {
        "launch": true,
        "host": "localhost",
        "port": 9222,
        "userDataDir": "/Users/nastasia/Library/Application Support/Google/Chrome"
    },
    "crawlControl": {
        "globalWait": 60000,
        "inflightIdle": 1000,
        "numInflight": 2,
        "navWait": 8000
    }
}   

Because it says that mode is undefined, I also placed mode under crawlControl as suggested in issue #50, but that doesn't solve the issue

What is the expected behavior?

A starting crawl.

What's your environment?

node v14.12.0 Squidwarc: current master macOS High Sierra 10.13.6 Chrome Versie 86.0.4240.80 (Officiële build) (x86_64)

Other information

I don't have this issue if I use puppeteer.

blzbrg commented 1 year ago

I think this is a just a typo in the mac version of the browser finding code, here is https://github.com/N0taN3rd/Squidwarc/blob/63026b6f4c30b83541f23fc7531126bc7e8747af/lib/launcher/chromeFinder.js#L275

    let sortedExes = installations
      // assign priorities
      .map(inst => {
        for (const pair of priorities) {
          if (pair.regex.test(inst)) {
            return { path: inst, weight: pair.weight }
          }
        }
        return { path: inst, weight: defaultPriority }
      })
      // sort based on priorities
      .sort((a, b) => b.weight - a.weight)
      // remove priority flag
      .map(pair => pair.path)[0]           # <=== this [0] is only in the mac version of this function
    if (sortedExes.length > 0) {
      return sortedExes[0]
    }

Will try to get access to a mac to test changing it to .map(pair => pair.path)