nt = Nutch('default')
jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc)
while True:
print("---------------- Printing JOb Progress ------- : ", cc.progress)
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
print("---------------- Printing JOb Progess ------- : ", job)
if job == None:
break
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30>
---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>>
nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571
nutch.py: GET Request data: {}
nutch.py: GET Request headers: {'Accept': 'application/json'}
nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'}
nutch.py: Response status: 200
nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'}
@@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED
Traceback (most recent call last):
File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
Nutch Hadoop Log file :
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb
2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543
2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest
2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp
2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083
java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
Error 404 Not Found
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:462)
at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:529)
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
Error 404 Not Found
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Not Found
at org.apache.solr.client.solrj.impl.HttpSolrClient.executeMethod(HttpSolrClient.java:544)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:240)
at org.apache.solr.client.solrj.impl.HttpSolrClient.request(HttpSolrClient.java:229)
at org.apache.solr.client.solrj.SolrRequest.process(SolrRequest.java:149)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:482)
at org.apache.solr.client.solrj.SolrClient.commit(SolrClient.java:463)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.commit(SolrIndexWriter.java:191)
at org.apache.nutch.indexwriter.solr.SolrIndexWriter.close(SolrIndexWriter.java:179)
at org.apache.nutch.indexer.IndexWriters.close(IndexWriters.java:117)
at org.apache.nutch.indexer.IndexerOutputFormat$1.close(IndexerOutputFormat.java:44)
at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.close(ReduceTask.java:502)
at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:456)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:392)
at org.apache.hadoop.mapred.LocalJobRunner$Job$ReduceTaskRunnable.run(LocalJobRunner.java:319)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker!
java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351)
at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)`
I'm trying to run this script but getting error whle indexing the data ( Job Status : Failed).
https://github.com/chrismattmann/nutch-python/wiki
Script:
from nutch.nutch import Nutch from nutch.nutch import SeedClient from nutch.nutch import Server from nutch.nutch import JobClient import nutch
sv=Server('http://localhost:8081') sc=SeedClient(sv) seed_urls=('http://espn.go.com','http://www.espn.com') sd= sc.create('espn-seed',seed_urls)
nt = Nutch('default') jc = JobClient(sv, 'test', 'default')
cc = nt.Crawl(sd, sc, jc) while True: print("---------------- Printing JOb Progress ------- : ", cc.progress) job = cc.progress() # gets the current job if no progress, else iterates and makes progress print("---------------- Printing JOb Progess ------- : ", job) if job == None: break
End Output :
`---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30> ---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>> nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571 nutch.py: GET Request data: {} nutch.py: GET Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 200 nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'RUNNING', 'msg': 'OK', 'crawlId': 'test'} @@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : RUNNING
---------------- Printing JOb Progess ------- : <nutch.nutch.Job object at 0x7f37df6a2d30> ---------------- Printing JOb Progress ------- : <bound method CrawlClient.progress of <nutch.nutch.CrawlClient object at 0x7f37e42f01d0>> nutch.py: GET Endpoint: /job/test-default-INDEX-1148278571 nutch.py: GET Request data: {} nutch.py: GET Request headers: {'Accept': 'application/json'} nutch.py: Response headers: {'Content-Type': 'application/json', 'Date': 'Thu, 19 Jul 2018 07:20:22 GMT', 'Transfer-Encoding': 'chunked', 'Server': 'Jetty(8.1.15.v20140411)'} nutch.py: Response status: 200 nutch.py: Response JSON: {'id': 'test-default-INDEX-1148278571', 'type': 'INDEX', 'confId': 'default', 'args': {'url_dir': 'seedFiles/seed-1531984813255'}, 'result': None, 'state': 'FAILED', 'msg': 'ERROR: java.io.IOException: Job failed!', 'crawlId': 'test'} @@@@@@@@@@@@@@@@@@@@@@@ Job STATE @@@@@@@@@@@@@@@@ : FAILED Traceback (most recent call last): File "test.py", line 18, in
job = cc.progress() # gets the current job if no progress, else iterates and makes progress
File "/home/purushottam/Documents/tech_learn/ex_nutch/nutch-python/nutch/nutch.py", line 563, in progress
raise NutchCrawlException
nutch.nutch.NutchCrawlException
`
Nutch Hadoop Log file :
`2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduce: crawldb: test/crawldb 2018-07-19 13:04:16,524 INFO indexer.IndexerMapReduce - IndexerMapReduces: adding segment: test/segments/20180719002543 2018-07-19 13:04:16,528 WARN mapreduce.JobResourceUploader - Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this. 2018-07-19 13:04:16,726 INFO indexer.IndexWriters - Adding org.apache.nutch.indexwriter.solr.SolrIndexWriter 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: content dest: content 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: title dest: title 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: host dest: host 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: segment dest: segment 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: boost dest: boost 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: digest dest: digest 2018-07-19 13:04:16,728 INFO solr.SolrMappingReader - source: tstamp dest: tstamp 2018-07-19 13:04:16,738 WARN mapred.LocalJobRunner - job_local1889671569_0083 java.lang.Exception: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
HTTP ERROR 404
Problem accessing /solr/update. Reason:
Caused by: org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://127.0.0.1:8983/solr: Expected mime type application/octet-stream but got text/html.
HTTP ERROR 404
Problem accessing /solr/update. Reason:
2018-07-19 13:04:17,636 ERROR impl.JobWorker - Cannot run job worker! java.io.IOException: Job failed! at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:873) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:96) at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:89) at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:351) at org.apache.nutch.service.impl.JobWorker.run(JobWorker.java:73) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) at java.lang.Thread.run(Thread.java:748)`