commoncrawl / news-crawl

News crawling with StormCrawler - stores content as WARC
Apache License 2.0
316 stars 34 forks source link

unable to fetch data from elasticsearch , no content is showing #38

Closed kaal-zathura24 closed 4 years ago

kaal-zathura24 commented 4 years ago

i have used the following github repository "https://github.com/commoncrawl/news-crawl" he has used the following versions of required libraries

Install Elasticsearch 7.5.0 Install Apache Storm 1.2.3 stormcrawler 1.16 maven 3.6.2

I have followed the path given in readme but my localhost:9200 is not showing any hits the configuration command is successfully running but it shows some FETCH_ERROR errors. No content or url is being shown.

sebastian-nagel commented 4 years ago

Well, that fetches fail and show up as FETCH_ERROR in the status index isn't unexpected, there are many possible reasons. Could you check the crawler logs (usually /var/log/storm/workers-artifacts/NewsCrawl*/*/worker.log) for errors? Every fetch is logged and if it fails the reason should be given there. Also important: the project does not index content into elasticsearch, only the status index is filled. The content is stored in WARC files placed in /data/warc/ by default. If you want to index the news content, that's possible but it won't be part of this project. Have a look at the documentation of stormcrawler to figure out how you need to modify the crawler topology to index page content.

kaal-zathura24 commented 4 years ago

thanks a lot , that helped


From: Sebastian Nagel notifications@github.com Sent: 11 June 2020 02:40 To: commoncrawl/news-crawl news-crawl@noreply.github.com Cc: 3342 GAURAV KUMAR gkumar_17291@aitpune.edu.in; Author author@noreply.github.com Subject: Re: [commoncrawl/news-crawl] unable to fetch data from elasticsearch , no content is showing (#38)

Well, that fetches fail and show up as FETCH_ERROR in the status index isn't unexpected, there are many possible reasons. Could you check the crawler logs (usually /var/log/storm/workers-artifacts/NewsCrawl//worker.log) for errors? Every fetch is logged and if it fails the reason should be given there. Also important: the project does not index content into elasticsearch, only the status index is filled. The content is stored in WARC files placed in /data/warc/ by default. If you want to index the news content, that's possible but it won't be part of this project. Have a look at the documentation of stormcrawler to figure out how you need to modify the crawler topology to index page content.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHubhttps://github.com/commoncrawl/news-crawl/issues/38#issuecomment-642267388, or unsubscribehttps://github.com/notifications/unsubscribe-auth/AP5IOSZF7OS3PSFOBGVMLI3RV7ZDLANCNFSM4N2WMESQ.