Closed chdoig closed 9 years ago
I just tried this with latest nutch from binstar: conda install -c quasiben nutch
and had no issue with the above seed list:
(cc_dev)quasiben@dirty-horse:~$ crawl ~/urls ~/CRAWL_OUTPUT 4
JAVA_HOME is set to '/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home'
~/anaconda/envs/cc_dev/lib/nutch ~
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/Users/quasiben/anaconda/envs/cc_dev/lib/nutch/bin/nutch inject /Users/quasiben/CRAWL_OUTPUT/crawldb /Users/quasiben/urls
Injector: starting at 2015-03-27 10:31:49
Injector: crawlDb: /Users/quasiben/CRAWL_OUTPUT/crawldb
Injector: urlDir: /Users/quasiben/urls
Injector: Converting injected urls to crawl db entries.
2015-03-27 10:31:49.751 java[3166:1580566] Unable to load realm info from SCDynamicStore
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 97
Injector: Total new urls injected: 97
Injector: finished at 2015-03-27 10:31:51, elapsed: 00:00:02
Fri Mar 27 10:31:52 CDT 2015 : Iteration 1 of 4
Generating a new segment
/Users/quasiben/anaconda/envs/cc_dev/lib/nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true /Users/quasiben/CRAWL_OUTPUT/crawldb /Users/quasiben/CRAWL_OUTPUT/segments -topN 50000 -numFetchers 1 -noFilter
seems to be associated with conda nutch package, not with this package in particular
how to slove this error . Iam using solr 4.10.3 , hbase0.98.19.hadoop and nutch 2.3.1
IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name= [TestCrawl5]Indexer, jobid=job_local112769475_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)
Error running: /usr/local/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/#/collection1 -all -crawlId TestCrawl5 Failed with exit value 255.
try using: http://github.com/chrismattmann/nutch-python Also this works with Nutch 1.x, not 2.x.