chrismattmann / nutch-python

Nutch-Python is a Python binding to the Apache Nutch™ REST services allowing Nutch to be called natively in the Python community. — Edit
Apache License 2.0
38 stars 20 forks source link

Error: raise NutchCrawlException nutch.nutch.NutchCrawlException while indexing #19

Open purushottam-bitle opened 6 years ago

purushottam-bitle commented 6 years ago

I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).

https://github.com/chrismattmann/nutch-python/wiki

Script:

from nutch.nutch import Nutch from nutch.nutch import SeedClient from nutch.nutch import Server from nutch.nutch import JobClient import nutch

sv=Server('http://localhost:8081') sc=SeedClient(sv) seed_urls=('http://espn.go.com','http://www.espn.com') sd= sc.create('espn-seed',seed_urls)

nt = Nutch('default') jc = JobClient(sv, 'test', 'default')

cc = nt.Crawl(sd, sc, jc) while True: print("---------------- Printing JOb Progress ------- : ", cc.progress) job = cc.progress() # gets the current job if no progress, else iterates and makes progress print("---------------- Printing JOb Progess ------- : ", job) if job == None: break

End Output :

`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30> ---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>> nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571 nutch.py: GET Request data: {} nutch.py: GET Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 200 nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'} @@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING

---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30> ---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>> nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571 nutch.py: GET Request data: {} nutch.py: GET Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 200 nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'} @@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED Traceback (most recent call last): File "test.py", line 18, in job = cc.progress() # gets the current job if no progress, else iterates and makes progress File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress raise NutchCrawlException nutch.nutch.NutchCrawlException `

Nutch Hadoop Log file :

`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb 2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543 2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.

Error 404 Not Found

HTTP ERROR 404

Problem accessing /solr/update. Reason:

    Not Found

at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)

Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.

Error 404 Not Found

HTTP ERROR 404

Problem accessing /solr/update. Reason:

    Not Found

at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker! java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351) at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)`