aigents / aigents-java

Aigents Java Core Platform
MIT License
29 stars 12 forks source link

Web paths formation improvements #23

Open akolonin opened 4 years ago

akolonin commented 4 years ago

The Problem: The PathFinder/PathTracker components responsible for building the "path" navigation across web links from page to page starting from the "root site URL" (rootPath) have two issues: 1) Redundant bath entires are formed sometimes (which causes over-consumption of memory and CPU cycles) 2) Empty path enties are formed sometimes (which causes exceptions like the following):

Fri Jun 05 13:47:30 UTC 2020:Site crawling failed unknown https://blog.wechat.com/category/news/ java.lang.ArrayIndexOutOfBoundsException: 0,:0
java.lang.ArrayIndexOutOfBoundsException: 0
        at net.webstructor.al.Set.get(Set.java:35)
        at net.webstructor.self.PathTracker.run(PathTracker.java:136)
        at net.webstructor.self.PathTracker.run(PathTracker.java:110)
        at net.webstructor.self.PathTracker.run(PathTracker.java:96)
        at net.webstructor.self.PathTracker.run(PathTracker.java:58)
        at net.webstructor.self.WebCrawler.crawl(WebCrawler.java:66)
        at net.webstructor.self.Siter.read(Siter.java:171)
        at net.webstructor.self.Spider$1.call(Spider.java:191)
        at java.util.concurrent.FutureTask.run(FutureTask.java:266)
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)

We need to solve both.

Extra: In addition to that, for each of the "sites" configured for crawling, we may have the option "crawl mode" (SMART|FIND|TRACK) set other than default "SMART" so the "path" can not be modified and always re-used as configured manually ("TRACK" mode) or never used so the exhaustive crawl applies every time ("FIND" mode).

akolonin commented 4 years ago

1 & 2 assumed fixed, keep testing...