ContinuumIO / nutchpy

For interacting with nutch via Python
Apache License 2.0
23 stars 16 forks source link

Error when trying to run nutch crawl #11

Closed chdoig closed 9 years ago

chdoig commented 9 years ago
(memex-explorer)cdoig@066-cdoig:~$ crawl ~/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/  ~/work/memex/memex/court_docs/crawl_test 4
JAVA_HOME is set to '/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home'
~/anaconda/envs/memex-explorer/lib/nutch ~
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Injector: starting at 2015-03-27 09:55:10
Injector: crawlDb: /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb
Injector: urlDir: /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir
Injector: Converting injected urls to crawl db entries.
Injector: java.net.UnknownHostException: 066-cdoig: 066-cdoig: nodename nor servname provided, or not known
    at java.net.InetAddress.getLocalHost(InetAddress.java:1473)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:960)
    at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
    at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
    at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:910)
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1353)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:324)
    at org.apache.nutch.crawl.Injector.run(Injector.java:380)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:370)
Caused by: java.net.UnknownHostException: 066-cdoig: nodename nor servname provided, or not known
    at java.net.Inet6AddressImpl.lookupAllHostAddr(Native Method)
    at java.net.InetAddress$1.lookupAllHostAddr(InetAddress.java:901)
    at java.net.InetAddress.getAddressesFromNameService(InetAddress.java:1293)
    at java.net.InetAddress.getLocalHost(InetAddress.java:1469)
    ... 12 more

Error running:
  /Users/cdoig/anaconda/envs/memex-explorer/lib/nutch/bin/nutch inject /Users/cdoig/work/memex/memex/court_docs/crawl_test/crawldb /Users/cdoig/work/memex/memex/court_docs/raw_data_no_comments/seed_dir/
Failed with exit value 255.
~
aterrel commented 9 years ago

Seed list at: https://github.com/ContinuumIO/memex/blob/master/court_docs/raw_data_no_comments/seednc.txt

quasiben commented 9 years ago

I just tried this with latest nutch from binstar: conda install -c quasiben nutch and had no issue with the above seed list:

(cc_dev)quasiben@dirty-horse:~$ crawl ~/urls ~/CRAWL_OUTPUT 4
JAVA_HOME is set to '/Library/Java/JavaVirtualMachines/jdk1.7.0_45.jdk/Contents/Home'
~/anaconda/envs/cc_dev/lib/nutch ~
No SOLRURL specified. Skipping indexing.
Injecting seed URLs
/Users/quasiben/anaconda/envs/cc_dev/lib/nutch/bin/nutch inject /Users/quasiben/CRAWL_OUTPUT/crawldb /Users/quasiben/urls
Injector: starting at 2015-03-27 10:31:49
Injector: crawlDb: /Users/quasiben/CRAWL_OUTPUT/crawldb
Injector: urlDir: /Users/quasiben/urls
Injector: Converting injected urls to crawl db entries.
2015-03-27 10:31:49.751 java[3166:1580566] Unable to load realm info from SCDynamicStore
Injector: overwrite: false
Injector: update: false
Injector: Total number of urls rejected by filters: 0
Injector: Total number of urls after normalization: 97
Injector: Total new urls injected: 97
Injector: finished at 2015-03-27 10:31:51, elapsed: 00:00:02
Fri Mar 27 10:31:52 CDT 2015 : Iteration 1 of 4
Generating a new segment
/Users/quasiben/anaconda/envs/cc_dev/lib/nutch/bin/nutch generate -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true /Users/quasiben/CRAWL_OUTPUT/crawldb /Users/quasiben/CRAWL_OUTPUT/segments -topN 50000 -numFetchers 1 -noFilter
chrismattmann commented 9 years ago

seems to be associated with conda nutch package, not with this package in particular

narendrakadari commented 8 years ago

how to slove this error . Iam using solr 4.10.3 , hbase0.98.19.hadoop and nutch 2.3.1

IndexingJob: starting SolrIndexerJob: java.lang.RuntimeException: job failed: name= [TestCrawl5]Indexer, jobid=job_local112769475_0001 at org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:120) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:154) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:176) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:202) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:211)

Error running: /usr/local/apache-nutch-2.3.1/runtime/local/bin/nutch index -D mapred.reduce.tasks=2 -D mapred.child.java.opts=-Xmx1000m -D mapred.reduce.tasks.speculative.execution=false -D mapred.map.tasks.speculative.execution=false -D mapred.compress.map.output=true -D solr.server.url=http://localhost:8983/solr/#/collection1 -all -crawlId TestCrawl5 Failed with exit value 255.

chrismattmann commented 8 years ago

try using: http://github.com/chrismattmann/nutch-python Also this works with Nutch 1.x, not 2.x.