USCDataScience / sparkler

Spark-Crawler: Apache Nutch-like crawler that runs on Apache Spark.
http://irds.usc.edu/sparkler/
Apache License 2.0
411 stars 143 forks source link

Errors and Exceptions when crawling - html.unit Time out exception #123

Closed amensiko closed 6 years ago

amensiko commented 7 years ago

I am new to Sparkler and crawling in general, and I have successfully tested it out with all the tutorials provided, but I'm currently having some issues when trying to crawl three links. The main goal of this process for me is to grab all the PDFs from the links.

I have successfully injected the three seeds:

./sparkler.sh inject -sf seeds.txt 
INFO  Injector$:98 [main] - Injecting 3 seeds
jobId = sjob-1498149633557

But when I crawl like this:

./sparkler.sh crawl -id sjob-1498149633557 -m local[*] -i 1

I get the following errors and exceptions:

2017-06-22 09:42:40 WARN  NativeCodeLoader:62 [main] - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2017-06-22 09:42:42 INFO  Crawler$:138 [main] - Starting the job:sjob-1498149633557, task:20170622094242
2017-06-22 09:42:42 INFO  CrawlDbRDD$:74 [main] - selecting 1 out of 1
2017-06-22 09:42:42 WARN  ClosureCleaner:70 [main] - Expected a closure; got edu.usc.irds.sparkler.solr.SolrStatusUpdate
2017-06-22 09:42:43 INFO  PluginService$:140 [Executor task launch worker-0] - Felix Configuration loaded successfully
2017-06-22 09:42:43 INFO  PluginService$:174 [Executor task launch worker-0] - Activated User bundles count = 2
2017-06-22 09:42:43 INFO  PluginService$:184 [Executor task launch worker-0] - Bundle Available fetcher.htmlunit
2017-06-22 09:42:43 INFO  PluginService$:187 [Executor task launch worker-0] - Starting the bundle name...
2017-06-22 09:42:43 INFO  PluginService$:190 [Executor task launch worker-0] - Bundle available but not required : fetcher.jbrowser.
2017-06-22 09:42:43 INFO  PluginService$:184 [Executor task launch worker-0] - Bundle Available urlfilter.regex
2017-06-22 09:42:43 INFO  PluginService$:187 [Executor task launch worker-0] - Starting the bundle name...
2017-06-22 09:42:43 INFO  HtmlUnitJsFetcherActivator:42 [FelixStartLevel] - Activating HtmlUnitFetcher Plugin
2017-06-22 09:42:44 INFO  RegexURLFilterActivator:40 [FelixStartLevel] - Activating RegexURL Plugin
2017-06-22 09:42:44 INFO  HtmlUnitFetcher:80 [Executor task launch worker-0] - Found 3 headers
2017-06-22 09:42:44 INFO  HtmlUnitFetcher:89 [Executor task launch worker-0] - HtmlUnit FETCHER https://web.archive.org/web/20170131094439/https://standards.nasa.gov/
2017-06-22 09:42:55 ERROR DefaultJavaScriptErrorListener:53 [Executor task launch worker-0] - Error loading JavaScript from [https://web.archive.org/web/20170131094439js_/https://standards.nasa.gov/sites/all/modules/extlink/extlink.js?o36afs].
java.net.SocketTimeoutException: Read timed out
    at java.net.SocketInputStream.socketRead0(Native Method)
    at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
    at java.net.SocketInputStream.read(SocketInputStream.java:171)
    at java.net.SocketInputStream.read(SocketInputStream.java:141)
    at sun.security.ssl.InputRecord.readFully(InputRecord.java:465)
    at sun.security.ssl.InputRecord.read(InputRecord.java:503)
    at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:973)
    at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:930)
    at sun.security.ssl.AppInputStream.read(AppInputStream.java:105)
    at org.apache.http.impl.io.SessionInputBufferImpl.streamRead(SessionInputBufferImpl.java:139)
    at org.apache.http.impl.io.SessionInputBufferImpl.fillBuffer(SessionInputBufferImpl.java:155)
    at org.apache.http.impl.io.SessionInputBufferImpl.readLine(SessionInputBufferImpl.java:284)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:138)
    at org.apache.http.impl.conn.DefaultHttpResponseParser.parseHead(DefaultHttpResponseParser.java:56)
    at org.apache.http.impl.io.AbstractMessageParser.parse(AbstractMessageParser.java:261)
    at org.apache.http.impl.DefaultBHttpClientConnection.receiveResponseHeader(DefaultBHttpClientConnection.java:165)
    at org.apache.http.impl.conn.CPoolProxy.receiveResponseHeader(CPoolProxy.java:165)
    at org.apache.http.protocol.HttpRequestExecutor.doReceiveResponse(HttpRequestExecutor.java:272)
    at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:124)
    at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272)
    at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185)
    at org.apache.http.impl.execchain.RetryExec.execute(RetryExec.java:89)
    at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:111)
    at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
    at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:72)
    at com.gargoylesoftware.htmlunit.HttpWebConnection.getResponse(HttpWebConnection.java:194)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1372)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseFromWebConnection(WebClient.java:1422)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponse(WebClient.java:1291)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadJavaScriptFromUrl(HtmlPage.java:1014)
    at com.gargoylesoftware.htmlunit.html.HtmlPage.loadExternalJavaScriptFile(HtmlPage.java:965)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.executeScriptIfNeeded(HtmlScript.java:352)
    at com.gargoylesoftware.htmlunit.html.HtmlScript$2.execute(HtmlScript.java:239)
    at com.gargoylesoftware.htmlunit.html.HtmlScript.onAllChildrenAddedToPage(HtmlScript.java:258)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:781)
    at org.apache.xerces.parsers.AbstractSAXParser.endElement(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.endElement(HTMLParser.java:738)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.callEndElement(HTMLTagBalancer.java:1243)
    at net.sourceforge.htmlunit.cyberneko.HTMLTagBalancer.endElement(HTMLTagBalancer.java:1143)
    at net.sourceforge.htmlunit.cyberneko.filters.DefaultFilter.endElement(DefaultFilter.java:226)
    at net.sourceforge.htmlunit.cyberneko.filters.NamespaceBinder.endElement(NamespaceBinder.java:345)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scanEndElement(HTMLScanner.java:3154)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner$ContentScanner.scan(HTMLScanner.java:2117)
    at net.sourceforge.htmlunit.cyberneko.HTMLScanner.scanDocument(HTMLScanner.java:945)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:521)
    at net.sourceforge.htmlunit.cyberneko.HTMLConfiguration.parse(HTMLConfiguration.java:472)
    at org.apache.xerces.parsers.XMLParser.parse(Unknown Source)
    at com.gargoylesoftware.htmlunit.html.HTMLParser$HtmlUnitDOMBuilder.parse(HTMLParser.java:988)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parse(HTMLParser.java:246)
    at com.gargoylesoftware.htmlunit.html.HTMLParser.parseHtml(HTMLParser.java:188)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createHtmlPage(DefaultPageCreator.java:272)
    at com.gargoylesoftware.htmlunit.DefaultPageCreator.createPage(DefaultPageCreator.java:160)
    at com.gargoylesoftware.htmlunit.WebClient.loadWebResponseInto(WebClient.java:520)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:394)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:311)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:459)
    at com.gargoylesoftware.htmlunit.WebClient.getPage(WebClient.java:444)
    at edu.usc.irds.sparkler.plugin.HtmlUnitFetcher.fetch(HtmlUnitFetcher.java:97)
    at edu.usc.irds.sparkler.util.FetcherDefault.apply(FetcherDefault.java:145)
    at edu.usc.irds.sparkler.util.FetcherDefault.apply(FetcherDefault.java:38)
    at edu.usc.irds.sparkler.util.StreamTransformer.next(StreamTransformer.java:50)
    at scala.collection.convert.Wrappers$JIteratorWrapper.next(Wrappers.scala:43)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:53)
    at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
    at scala.collection.Iterator$$anon$12.next(Iterator.scala:444)
    at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
    at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
    at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
    at org.apache.spark.scheduler.Task.run(Task.scala:89)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
2017-06-22 09:43:01 WARN  HtmlScript:425 [Executor task launch worker-0] - Script is not JavaScript (type: javascript, language: ). Skipping execution.
2017-06-22 09:43:01 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: '#views_slideshow_cycle_main_homepageslideshow-block :first' error: Invalid selector: #views_slideshow_cycle_main_homepageslideshow-block :first).] sourceName=[https://web.archive.org/web/20170131094439js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:07 INFO  ParseFunction$:49 [Executor task launch worker-0] - PARSING  https://web.archive.org/web/20170131094439/https://standards.nasa.gov/
2017-06-22 09:43:08 INFO  HtmlUnitFetcher:89 [Executor task launch worker-0] - HtmlUnit FETCHER https://web.archive.org/web/20161128203508/https://standards.nasa.gov/nasa-technical-standards
2017-06-22 09:43:15 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: 'a:not(.ext, .mailto), area:not(.ext, .mailto)' error: Invalid selectors: a:not(.ext, .mailto), area:not(.ext, .mailto)).] sourceName=[https://web.archive.org/web/20161128203508js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:15 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: 'area:not(.ext, .mailto)' error: Invalid selectors: area:not(.ext, .mailto)).] sourceName=[https://web.archive.org/web/20161128203508js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:15 INFO  ParseFunction$:49 [Executor task launch worker-0] - PARSING  https://web.archive.org/web/20161128203508/https://standards.nasa.gov/nasa-technical-standards
2017-06-22 09:43:16 INFO  HtmlUnitFetcher:89 [Executor task launch worker-0] - HtmlUnit FETCHER https://web.archive.org/web/20161202105123/https://standards.nasa.gov/standard/nasa/nasa-std-3001-vol-1
2017-06-22 09:43:25 WARN  HtmlScript:425 [Executor task launch worker-0] - Script is not JavaScript (type: javascript, language: ). Skipping execution.
2017-06-22 09:43:25 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: ':input' error: Invalid selector: :input).] sourceName=[https://web.archive.org/web/20161202105123js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:25 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: 'a:not(.ext, .mailto), area:not(.ext, .mailto)' error: Invalid selectors: a:not(.ext, .mailto), area:not(.ext, .mailto)).] sourceName=[https://web.archive.org/web/20161202105123js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:25 ERROR StrictErrorReporter:82 [Executor task launch worker-0] - runtimeError: message=[An invalid or illegal selector was specified (selector: 'area:not(.ext, .mailto)' error: Invalid selectors: area:not(.ext, .mailto)).] sourceName=[https://web.archive.org/web/20161202105123js_/https://standards.nasa.gov/misc/jquery.js?v=1.4.4] line=[101] lineSource=[null] lineOffset=[0]
2017-06-22 09:43:27 INFO  ParseFunction$:49 [Executor task launch worker-0] - PARSING  https://web.archive.org/web/20161202105123/https://standards.nasa.gov/standard/nasa/nasa-std-3001-vol-1
2017-06-22 09:43:27 WARN  ClosureCleaner:70 [main] - Expected a closure; got edu.usc.irds.sparkler.solr.SolrUpsert
2017-06-22 09:43:27 INFO  SolrUpsert$:51 [Executor task launch worker-0] - Inserting new resources to Solr 
2017-06-22 09:43:27 INFO  Crawler$:193 [main] - Storing output at sjob-1498149633557/20170622094242
2017-06-22 09:43:27 INFO  Crawler$:167 [main] - Committing crawldb..
2017-06-22 09:43:28 INFO  Crawler$:172 [main] - Shutting down Spark CTX..
2017-06-22 09:43:29 INFO  PluginService$:119 [Felix-sjob-1498149633557] - Going to stop Services...
2017-06-22 09:43:29 INFO  RegexURLFilterActivator:48 [FelixStartLevel] - Stopping RegexURL Plugin
2017-06-22 09:43:29 INFO  HtmlUnitJsFetcherActivator:51 [FelixStartLevel] - Stopping HtmlUnitFetcher Plugin
2017-06-22 09:43:29 INFO  HtmlUnitFetcher:144 [FelixStartLevel] - Closing the JS browser

Since I am new to the concept, I'm having a problem understanding the errors and what I should be fixing/doing. Any help and/or advice would be appreciated! Thank you!

thammegowda commented 7 years ago

@amensiko Looks like html unit has timeout issues, Could you please do the following:

  1. Comment https://github.com/USCDataScience/sparkler/blob/master/conf/sparkler-default.yaml#L90
  2. Rebuild mvn clean package
  3. Test again
amensiko commented 7 years ago

@thammegowda yes, that works, thank you!

thammegowda commented 6 years ago

Disabled that plugin by default. https://github.com/USCDataScience/sparkler/commit/701f61d43b0a168bbe0cf8439084bb991a3b7154#diff-5af38dcbeba9ee44ecb0a9089219b4cbR89

This is the best we could do as of now so I am closing this issue.