Prevents the crawl to break for any uncaught exception raised in URL Filter(s) and/or isValid method from UrlValidator class of Apache Commons.
One issue with isValid has been identified where if the length of authority component, in a URL, is greater than 256, it throws java.lang.IndexOutOfBoundsException exception. The details are below:
java.lang.IndexOutOfBoundsException
at sun.net.idn.Punycode.encode(Punycode.java:188)
at java.net.IDN.toASCIIInternal(IDN.java:320)
at java.net.IDN.toASCII(IDN.java:122)
at java.net.IDN.toASCII(IDN.java:151)
at org.apache.commons.validator.routines.DomainValidator.unicodeToASCII(DomainValidator.java:1764)
at org.apache.commons.validator.routines.UrlValidator.isValidAuthority(UrlValidator.java:389)
at org.apache.commons.validator.routines.UrlValidator.isValid(UrlValidator.java:323)
at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunction$$anonfun$apply$1.apply(OutLinkFilterFunction.scala:42)
at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunction$$anonfun$apply$1.apply(OutLinkFilterFunction.scala:40)
at scala.collection.immutable.HashSet$HashSet1.foreach(HashSet.scala:322)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:978)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:978)
at scala.collection.immutable.HashSet$HashTrieSet.foreach(HashSet.scala:978)
at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunction$.apply(OutLinkFilterFunction.scala:40)
at edu.usc.irds.sparkler.pipeline.OutLinkFilterFunction$.apply(OutLinkFilterFunction.scala:31)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:62)
at edu.usc.irds.sparkler.pipeline.FairFetcher.next(FairFetcher.scala:28)
at scala.collection.Iterator$$anon$12.next(Iterator.scala:397)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:285)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:227)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
isValid
method fromUrlValidator
class of Apache Commons.isValid
has been identified where if the length ofauthority
component, in a URL, is greater than 256, it throwsjava.lang.IndexOutOfBoundsException
exception. The details are below: