IBM-Watson / nutch-indexer-discovery

Watson Discovery Service indexing plugin for Apache Nutch
9 stars 8 forks source link

Crawler stuck on nutch InjectorJob #4

Open havardthom opened 5 years ago

havardthom commented 5 years ago

Hi, I just installed this crawler and I'm having an issue. Testing the crawler with just one URL and it seems to get stuck on the nutch InjectorJob, nothing happens after the following:

[nutch-indexer-discovery]$ ./crawl
Injecting urls from ./seed/urls.txt
./build/apache-nutch-2.3.1/runtime/local/bin/nutch inject ./seed/urls.txt
InjectorJob: starting at 2018-10-23 13:13:36
InjectorJob: Injecting urlDir: seed/urls.txt

Installation and setup went fine, except some warning when I ran ./gradlew buildPlugin:

[ant:taskdef] Could not load definitions from resource org/sonar/ant/antlib.xml. It could not be found.

Any idea what might be wrong here?

havardthom commented 5 years ago

So it's not stuck, just very very slow. 2 hours to inject one url.. currently at this stage:

./build/apache-nutch-2.3.1/runtime/local/bin/nutch inject ./seed/urls.txt
InjectorJob: starting at 2018-10-23 13:19:22
InjectorJob: Injecting urlDir: seed/urls.txt
InjectorJob: Using class org.apache.gora.hbase.store.HBaseStore as the Gora storage class.
InjectorJob: total number of urls rejected by filters: 0
InjectorJob: total number of urls injected after normalization and filtering: 1
Injector: finished at 2018-10-23 15:21:13, elapsed: 02:01:50
Generate urls: 
./build/apache-nutch-2.3.1/runtime/local/bin/nutch generate -topN 5
GeneratorJob: starting at 2018-10-23 15:21:14
GeneratorJob: Selecting best-scoring urls due for fetch.
GeneratorJob: starting
GeneratorJob: filtering: true
GeneratorJob: normalizing: true
GeneratorJob: topN: 5