N0taN3rd / Squidwarc

Squidwarc is a high fidelity, user scriptable, archival crawler that uses Chrome or Chromium with or without a head
https://n0tan3rd.github.io/Squidwarc/
Apache License 2.0
166 stars 26 forks source link

TypeError: input.on is not a function #23

Closed peterk closed 5 years ago

peterk commented 6 years ago

Are you submitting a bug report or a feature request?

Bug report

What is the current behavior?

Tried setting up a docker image based on the zenika/alpine-chrome image. Copied over Squidwarc. Headless Chrome starts correctly but trying to run the included conf.json crawl script I get the following error:

/usr/src/app/Squidwarc $ node --harmony index.js -c conf.json
Running Crawl From Config File conf.json
Crawler Operating In page-only mode
Crawler Will Be Preserving 1 Seeds
Crawler Will Be Generating WARC Files Using the filenamified url
Crawler Generated WARCs Will Be Placed At /usr/src/app/Squidwarc
Crawler Is Connecting To Chrome On Host localhost
Crawler Is Connecting To Chrome On Port 9222
Crawler Will Be Waiting At Maximum For Navigation To Happen For 8s
Crawler Will Be Waiting After For 2 inflight requests
A Fatal Error Occurred
  TypeError: input.on is not a function

  - readline.js:189 new Interface
    readline.js:189:11

  - readline.js:69 Object.createInterface
    readline.js:69:10

  - launcher.js:436 Promise
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:436:25

  - new Promise

  - launcher.js:435 waitForWSEndpoint
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:435:10

  - launcher.js:255 Function.launch
    /usr/src/app/Squidwarc/lib/crawler/launcher.js:255:31

Please Inform The Maintainer Of This Project About It. Information In package.json
events.js:167
      throw er; // Unhandled 'error' event
      ^

What is the expected behavior?

The crawl script should execute correctly.

What's your environment?

Alpine Linux 3.7

/usr/src/app/Squidwarc $ node --version
v10.6.0

/usr/src/app/Squidwarc $ npm --version
6.1.0

/usr/src/app/Squidwarc $ chromium-browser --version
Chromium 64.0.3282.168

Other information

Thank you for what looks like a great project!

N0taN3rd commented 6 years ago

@peterk thanks for opening up the issue. Would you mind (and are able to) sharing the docker file used when you encountered this issue?

I am currently attempting to reproduce.

peterk commented 6 years ago

Thank's for looking in to it. Like this:

FROM zenika/alpine-chrome

USER root
RUN apk add git
RUN apk add --update make
RUN apk add --update g++
RUN apk add --update vim
RUN apk add --update bash

RUN git clone https://github.com/N0taN3rd/Squidwarc.git
WORKDIR Squidwarc
RUN npm install

USER chrome
ENTRYPOINT ["chromium-browser", "--headless", "--disable-gpu", "--disable-software-rasterizer", "--no-sandbox"]

(sorry, missed the last line in the first copy/paste)

peterk commented 6 years ago

Process:

docker build -t squidwarc:0.7 .
docker run squidwarc:0.7

(in other terminal - id whatever id it got when you started it) docker exec -it 809f5544ba2d /bin/bash

(in container) ./run-crawler.sh -c conf.json

N0taN3rd commented 6 years ago

@peterk it appears that zenika/alpine-chrome already launches chrome but does not launch chrome with --remote-debugging-port=9222 which is required for Squidwarc to operate properly.

My suggestion would to be, when using that image, to not use the default entry point rather simply run node index.js -c conf.json. Also you will need to modify the conf file to enable headless mode by adding

"headless": true

to it

N0taN3rd commented 6 years ago

I have also updated the README.md file to explain usage of Squidwarc@1.2.0

peterk commented 6 years ago

Great! Thank you!