USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
410 stars 143 forks source link

Argument '-i -1' does not work. #185

Open MobinRanjbar opened 4 years ago

MobinRanjbar commented 4 years ago

Hi there,

I wanted to crawl whole content of a website. When I run the command below, crawling process does not start. What is wrong?

bin/sparkler.sh crawl -id 1 -i -1

Output: 2020-06-19 12:38:06 INFO Crawler$:153 - Committing crawldb.. 2020-06-19 12:38:06 INFO Crawler$:221 - Shutting down Spark CTX..

thammegowda commented 4 years ago

Sparkler does nothing when no URLs are there to crawl. And your output looks like there are no new URLs to be crawled.
try injecting some new URLs and try again.

MobinRanjbar commented 4 years ago

Hi,

I have injected a new URL before that like below. The same thing happens.

bin/sparkler.sh inject -id 1 -su 'https://www.nasa.gov/'

thammegowda commented 4 years ago

I am guessing there is an error in your setup. Did you try it from docker image https://hub.docker.com/r/uscdatascience/sparkler/tags ; could you please try?

CC @buggtb do you have any guesses on why/when/how this case might happen?

MobinRanjbar commented 4 years ago

Hi,

The same thing happened in docker!! :

sparkler@292e25536b51:/data/sparkler$ bin/sparkler.sh inject -id 1 -su 'https://www.nasa.gov/' 2020-06-23 07:46:16 INFO Injector$:97 - Injecting 1 seeds jobId = 1 sparkler@292e25536b51:/data/sparkler$ bin/sparkler.sh crawl -id 1 -tn 100 -i -1 2020-06-23 07:46:35 WARN NativeCodeLoader:60 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable 2020-06-23 07:46:40 INFO Crawler$:153 - Committing crawldb.. 2020-06-23 07:46:40 INFO Crawler$:221 - Shutting down Spark CTX.. sparkler@292e25536b51:/data/sparkler$

Have you ever tried that argument?