The Problem:
The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues:
1) Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles)
2) Empty path enties are formed sometimes (which causes exceptions like the following):
Fri Jun 05 13:47:30 UTC 2020:Site crawling failed unknown https://blog.wechat.com/category/news/ java.lang.ArrayIndexOutOfBoundsException: 0,:0
java.lang.ArrayIndexOutOfBoundsException: 0
at net.webstructor.al.Set.get(Set.java:35)
at net.webstructor.self.PathTracker.run(PathTracker.java:136)
at net.webstructor.self.PathTracker.run(PathTracker.java:110)
at net.webstructor.self.PathTracker.run(PathTracker.java:96)
at net.webstructor.self.PathTracker.run(PathTracker.java:58)
at net.webstructor.self.WebCrawler.crawl(WebCrawler.java:66)
at net.webstructor.self.Siter.read(Siter.java:171)
at net.webstructor.self.Spider$1.call(Spider.java:191)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
We need to solve both.
Extra:
In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).
The Problem: The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues: 1) Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles) 2) Empty path enties are formed sometimes (which causes exceptions like the following):
We need to solve both.
Extra: In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).