VIDA-NYU / ache

ACHE is a web crawler for domain-specific search.
http://ache.readthedocs.io
Apache License 2.0
449 stars 135 forks source link

Crawler does not crawl all links of paginated forum list #297

Closed JuliusHenke closed 2 years ago

JuliusHenke commented 2 years ago

Hi, I am currently trying to crawl a forum, which uses paginated result lists. When crawling the forum with a whitelist regex filter, the crawler does not seem to crawl all relevant pages. I am unsure if this is due to links not being recognized or falsely not being scheduled. I do think my regex is correct. The crawler is configured to use Tor and started with Docker Compose.

ache.yml config

target_storage.data_formats:
  - FILES

target_storage.data_format.elasticsearch.rest.hosts:
  - http://elasticsearch:9200

target_storage.data_format.filesystem.compress_data: false
target_storage.hard_focus: false

# Basic configuration in-depth website crawling
link_storage.link_strategy.use_scope: true
link_storage.link_strategy.outlinks: true
link_storage.link_selector: NonRandomLinkSelector

link_storage.scheduler.host_min_access_interval: 1500
link_storage.scheduler.max_links: 1000000
link_storage.max_size_cache_urls: 100000

# Configure ACHE to download .onion URLs through the TOR proxy container
crawler_manager.downloader.torproxy: http://torproxy:8118
crawler_manager.downloader.tor.max_retry_count: 5
crawler_manager.downloader.tor.socket_timeout: 2000000
crawler_manager.downloader.tor.connection_timeout: 2000000
crawler_manager.downloader.tor.connection_request_timeout: 2000000

crawler_manager.downloader.user_agent.string: Mozilla/5.0 (Windows NT 10.0; rv:91.0) Gecko/20100101 Firefox/91.0

crawler_manager.downloader.valid_mime_types:
  - text/html

tor.seeds

https://community.mybb.com/forum-176.html

link-filters.yml

global:
  type: regex
  whitelist:
    - https://community\.mybb\.com/forum-176.*\.html

Result of crawled pages, which are listed in "/default/data_monitor/crawledpages.csv"

https://community.mybb.com/forum-176.html
https://community.mybb.com/forum-176-page-2.html
https://community.mybb.com/forum-176-page-3.html
https://community.mybb.com/forum-176-page-4.html
https://community.mybb.com/forum-176-page-5.html
https://community.mybb.com/forum-176-page-6.html
https://community.mybb.com/forum-176-page-7.html
https://community.mybb.com/forum-176-page-984.html
https://community.mybb.com/forum-176-page-980.html
https://community.mybb.com/forum-176-page-981.html
https://community.mybb.com/forum-176-page-982.html
https://community.mybb.com/forum-176-page-983.html

The result misses all pages between 7 and 980. Do you have any idea, if I have misconfigured something? If this is a bug and you can point me into the right direction, I am also willing to fix this. What is the best setup to debug something like this with the ACHE source code?

Crawler log

-------------------
ACHE Crawler 0.14.0
-------------------

[2022-08-20 21:11:52,839] INFO [main] (LinkFilter.java:112) - Loading link patterns from link_filters.yml file at /config/
[2022-08-20 21:11:52,847] INFO [main] (LinkFilter.java:180) - Loading link filter patterns for top-private domains:
[2022-08-20 21:11:52,872] INFO [main] (LinkFilter.java:184) - global
[2022-08-20 21:11:52,873] INFO [main] (FrontierManagerFactory.java:36) - LINK_SELECTOR: achecrawler.link.frontier.selector.NonRandomLinkSelector
[2022-08-20 21:11:52,878] INFO [main] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:52,879] INFO [main] (CrawlScheduler.java:177) - Loaded 0 links.
[2022-08-20 21:11:53,687] INFO [main] (FrontierManager.java:236) - Adding 1 seed URL(s)...
[2022-08-20 21:11:53,698] INFO [main] (FrontierManager.java:248) - Added seed URL: https://community.mybb.com/forum-176.html
WARNING: An illegal reflective access operation has occurred
WARNING: Illegal reflective access by com.esotericsoftware.kryo.util.UnsafeUtil (file:/ache/lib/kryo-4.0.2.jar) to constructor java.nio.DirectByteBuffer(long,int,java.lang.Object)
WARNING: Please consider reporting this to the maintainers of com.esotericsoftware.kryo.util.UnsafeUtil
WARNING: Use --illegal-access=warn to enable warnings of further illegal reflective access operations
WARNING: All illegal access operations will be denied in a future release
[2022-08-20 21:11:53,729] INFO [main] (FrontierManager.java:256) - Number of seeds added: 1
[2022-08-20 21:11:53,729] INFO [main] (FrontierManager.java:260) - Using scope of following domains:
[2022-08-20 21:11:53,729] INFO [main] (FrontierManager.java:262) - community.mybb.com
[2022-08-20 21:11:53,730] INFO [main] (TargetRepositoryFactory.java:57) - Loading repository with data_format=FILES from /data/default/data_pages
[2022-08-20 21:11:54,357] INFO [main] (Log.java:170) - Logging initialized @2448ms to org.eclipse.jetty.util.log.Slf4jLog
[2022-08-20 21:11:54,369] INFO [main] (JavalinLogger.kt:22) - Static file handler added: StaticFileConfig(hostedPath=/, directory=/public, location=CLASSPATH, precompress=false, aliasCheck=null, headers={Cache-Control=max-age=0}, skipFileFunction=Function1<javax.servlet.http.HttpServletRequest, java.lang.Boolean>). File system location: 'jar:file:/ache/lib/ache-0.14.0.jar!/public'
[2022-08-20 21:11:54,502] INFO [main] (RestServer.java:137) - ---------------------------------------------
[2022-08-20 21:11:54,502] INFO [main] (RestServer.java:138) - ACHE server available at http://0.0.0.0:8080
[2022-08-20 21:11:54,504] INFO [main] (RestServer.java:139) - ---------------------------------------------
[2022-08-20 21:11:54,508] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:54,508] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:54,510] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:54,523] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:11:54,523] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:11:55,516] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:55,516] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:55,516] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:55,521] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 0 links.
[2022-08-20 21:11:55,522] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
log4j:WARN No appenders could be found for logger (org.apache.http.impl.conn.PoolingHttpClientConnectionManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.
[2022-08-20 21:11:56,520] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:56,520] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:56,520] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:56,543] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:11:56,543] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:11:57,543] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:57,543] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:57,544] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:57,553] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:11:57,553] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:11:58,549] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:58,550] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:58,551] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:58,572] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:11:58,572] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:11:59,560] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:11:59,562] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:11:59,568] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:11:59,579] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:11:59,579] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:00,573] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:00,573] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:00,574] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:00,578] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:00,578] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:01,587] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:01,587] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:01,588] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:01,598] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:01,599] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:02,590] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:02,590] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:02,590] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:02,597] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:02,597] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:03,606] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:03,606] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:03,613] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:03,621] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:03,621] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:04,608] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:04,608] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:04,608] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:04,611] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:04,612] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:05,611] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:05,611] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:05,612] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:05,618] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:05,618] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:06,614] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:06,614] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:06,615] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:06,633] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:06,634] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:07,632] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:07,632] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:07,634] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:07,649] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:07,650] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:08,638] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:08,638] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:08,638] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:08,641] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:08,641] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:09,661] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:09,661] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:09,660] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:09,673] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 0 links.
[2022-08-20 21:12:09,674] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:10,667] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:10,667] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:10,668] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:10,698] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:10,699] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:11,684] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:11,688] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:11,691] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:11,701] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:11,701] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:12,697] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:12,697] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:12,698] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:12,718] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:12,719] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:13,717] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:13,718] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:13,724] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:13,730] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:13,730] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:14,727] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:14,727] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:14,728] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:14,734] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 1 links.
[2022-08-20 21:12:14,734] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:15,741] INFO [FrontierLinkLoader] (CrawlScheduler.java:208) - Starting scheduler queues reload...
[2022-08-20 21:12:15,744] INFO [FrontierLinkLoader] (CrawlScheduler.java:92) - Loading more links from frontier into the scheduler...
[2022-08-20 21:12:15,747] INFO [AsyncCrawler] (AsyncCrawler.java:77) - Waiting for links from pages being downloaded...
[2022-08-20 21:12:15,758] INFO [FrontierLinkLoader] (CrawlScheduler.java:177) - Loaded 0 links.
[2022-08-20 21:12:15,758] INFO [FrontierLinkLoader] (CrawlScheduler.java:212) - Reload done.
[2022-08-20 21:12:16,750] INFO [AsyncCrawler] (AsyncCrawler.java:85) - LinkStorage ran out of links, stopping crawler.
[2022-08-20 21:12:16,752] INFO [AsyncCrawler] (AsyncCrawler.java:96) - Starting crawler shutdown...
[2022-08-20 21:12:16,752] INFO [AsyncCrawler] (HttpDownloader.java:214) - Waiting downloads be finalized...
[2022-08-20 21:12:16,757] INFO [AsyncCrawler] (LinkStorage.java:63) - Shutting down FrontierManager...
[2022-08-20 21:12:16,865] INFO [AsyncCrawler] (LinkStorage.java:65) - done.
[2022-08-20 21:12:16,889] INFO [AsyncCrawler] (AsyncCrawler.java:104) - Shutdown finished.
[2022-08-20 21:12:16,889] INFO [main] (JavalinLogger.kt:22) - Stopping Javalin ...
[2022-08-20 21:12:16,901] INFO [main] (JavalinLogger.kt:22) - Javalin has stopped
aecio commented 2 years ago

I think you should probably not be using the NonRandomLinkSelector. This is poorly documented, but it still exists in the codebase only for legacy reasons. Can you try another selector such as the default TopkLinkSelector or the MaximizeWebsitesLinkSelector if you are crawling more than one website?

JuliusHenke commented 2 years ago

Thanks for the quick reply. I did try all of these type of link selectors:

As expected, it resulted in a different order of when a page was crawled, but did not change the total amount of crawled pages. I also tried not using Tor.

aecio commented 2 years ago

What if you also set the link classifier to something like the following?

link_storage.link_classifier.type: MaxDepthLinkClassifier
link_storage.link_classifier.max_depth: 2147483647

The link classifier (documented here: https://ache.readthedocs.io/en/latest/crawling-strategies.html#link-classifiers) is the class actually assigns the scores and defines the crawling order. The MaxDepthLinkClassifier will crawl pages in the same order that they are discovered up to a maximum depth.

Another option could be the LinkClassifierRandom.

link_storage.link_classifier.type: LinkClassifierRandom
JuliusHenke commented 2 years ago

Setting this actually did the trick and the crawler downloaded all expected pages:

link_storage.link_classifier.type: MaxDepthLinkClassifier
link_storage.link_classifier.max_depth: 2147483647

Thanks a lot!