USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Crawler success but data is not populated into dashboard and output file #165

Closed kavitasharma21 closed 3 years ago

kavitasharma21 commented 6 years ago

I am new to Sparkler project . followed the instructions given in the document. below are logs :

2018-06-22 16:09:19 WARN  NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-06-22 16:09:22 INFO  Crawler$:150 [main] - Starting the job:j1, task:20180622160922
2018-06-22 16:09:23 INFO  CrawlDbRDD$:75 [main] - selecting 1 out of 1
2018-06-22 16:09:23 WARN  ClosureCleaner:70 [main] - Expected a closure; got edu.usc.irds.sparkler.pipeline.ScoreFunction
2018-06-22 16:09:23 WARN  ClosureCleaner:70 [main] - Expected a closure; got edu.usc.irds.sparkler.solr.SolrUpsert
2018-06-22 16:09:24 INFO  PluginService$:53 [Executor task launch worker-0] - Loading plugins...
2018-06-22 16:09:24 ERROR CompoundPluginDescriptorFinder:71 [Executor task launch worker-0] - Cannot find 'plugin.properties' path
2018-06-22 16:09:24 ERROR CompoundPluginDescriptorFinder:71 [Executor task launch worker-0] - Cannot find 'plugin.properties' path
2018-06-22 16:09:24 ERROR CompoundPluginDescriptorFinder:71 [Executor task launch worker-0] - Cannot find 'plugin.properties' path
2018-06-22 16:09:24 ERROR CompoundPluginDescriptorFinder:71 [Executor task launch worker-0] - Cannot find 'plugin.properties' path
2018-06-22 16:09:24 ERROR CompoundPluginDescriptorFinder:71 [Executor task launch worker-0] - Cannot find 'plugin.properties' path
2018-06-22 16:09:24 INFO  PluginService$:62 [Executor task launch worker-0] - 2 plugin(s) Active: [urlfilter-regex, urlfilter-samehost]
2018-06-22 16:09:24 WARN  PluginService$:65 [Executor task launch worker-0] - 3 extra plugin(s) available but not activated: Set(template-plugin, fetcher-jbrowser, fetcher-htmlunit)
2018-06-22 16:09:24 INFO  PluginService$:79 [Executor task launch worker-0] - Recognised Plugins: Map(urlfilter-regex -> edu.usc.irds.sparkler.plugin.RegexURLFilter, urlfilter-samehost -> edu.usc.irds.sparkler.plugin.UrlFilterSameHost)
2018-06-22 16:09:24 INFO  FetcherDefault:109 [Executor task launch worker-0] - DEFAULT FETCHER https://spark.apache.org/
2018-06-22 16:09:29 WARN  FetcherDefault:153 [Executor task launch worker-0] - FETCH-ERROR https://spark.apache.org/
2018-06-22 16:09:29 INFO  ParseFunction$:49 [Executor task launch worker-0] - PARSING  https://spark.apache.org/
2018-06-22 16:09:30 INFO  PluginService$:106 [Executor task launch worker-0] - Chaining [edu.usc.irds.sparkler.plugin.RegexURLFilter@2de54b17, edu.usc.irds.sparkler.plugin.UrlFilterSameHost@26643e63] using class edu.usc.irds.sparkler.service.RejectingURLFilterChain
2018-06-22 16:09:30 INFO  PluginService$:110 [Executor task launch worker-0] - Initialize class edu.usc.irds.sparkler.plugin.RegexURLFilter as urlfilter-regex
2018-06-22 16:09:30 INFO  PluginService$:110 [Executor task launch worker-0] - Initialize class edu.usc.irds.sparkler.plugin.UrlFilterSameHost as urlfilter-samehost
2018-06-22 16:09:30 INFO  RejectingURLFilterChain:69 [Executor task launch worker-0] - Initializing edu.usc.irds.sparkler.service.RejectingURLFilterChain with 2 extensions: [edu.usc.irds.sparkler.plugin.RegexURLFilter@2de54b17, edu.usc.irds.sparkler.plugin.UrlFilterSameHost@26643e63]
2018-06-22 16:09:30 INFO  SolrUpsert$:51 [Executor task launch worker-0] - Inserting new resources to Solr
2018-06-22 16:09:30 WARN  ClosureCleaner:70 [main] - Expected a closure; got edu.usc.irds.sparkler.solr.SolrStatusUpdate
2018-06-22 16:09:30 INFO  Crawler$:212 [main] - Storing output at j1/20180622160922
2018-06-22 16:09:30 INFO  Crawler$:164 [main] - ===End of iteration 1 Committing crawldb..===
2018-06-22 16:09:30 INFO  Crawler$:170 [main] - Shutting down Spark CTX..
2018-06-22 16:09:31 WARN  PluginService$:49 [Thread-19] - Stopping all plugins... Runtime is about to exit.     

Job is success but no data has been populated into the dashboard and part-0000 file . let me know what is the issue .

thammegowda commented 6 years ago

@kavitasharma21

2018-06-22 16:09:29 WARN FetcherDefault:153 [Executor task launch worker-0] - FETCH-ERROR https://spark.apache.org/

This line shows the fetcher couldn't fetch the url. Not sure why though. Here are my best guesses:

  1. It could be because of a network outage when you ran the crawl job.
  2. Could be because the URL is https:// and the security setup on your JVM was too strict (sparkler's default fetcher uses JVM's URLConnection class to fetch web pages, it had proved to be pretty reliable in our past experience)

What I suggest:

P.S. watch for the logs to see any useful log messages, and let us know if this issue is reproducible.