A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones; i.e. maximizing the harvest rate. Reinforcement Learning (RL) has been utilized to optimize the crawling process, yet it deals with huge state and action spaces, which can constitute a serious challenge. In this paper, we propose Tree REinforcement Spider (TRES), an end-to-end RL-empowered framework for focused crawling. Unlike other crawling approaches, we properly model a crawling environment as a Markov Decision Process, by representing the state as a subgraph of the Web and actions as its expansion edges. Exploiting a few initial keywords, which are related to the target topic, TRES adopts a keyword expansion strategy based on the cosine similarity of keyword word2vec embeddings. To learn a reward function, we propose a deep neural network, called KwBiLSTM, leveraging the keywords discovered in the expansion stage. To reduce the time complexity of selecting a best action, we propose Tree-Frontier, a two-fold decision tree, which also speeds up training by discretizing the state and action spaces. Experimentally, we show that TRES outperforms state-of-the-art methods in terms of harvest rate by at least 58\%, while it has competitive results in the domain maximization setting, i.e. the task of maximizing the number of different fetched web sites.
Also, link to paper (preprint).
git clone https://github.com/ddaedalus/tres.git
After you have extracted the .zip file and changed directory, install the requirements.
pip install -r requirements.txt
Download the "files" directory from Google Drive.
Create 2 files in the "files" directory: (a) seeds.txt with the seed URLs that the crawler would start its process and (b) data.txt providing relevant URLs (around 800-1000 at most) for training KwBiLSTM.
Modify the ./configuration/config.py file with your preferences.
Insert your keywords (and your keyphrases, if any) in the ./configuration/taxonomy.py
Run this command to extract new keywords.
python3 keyword_extract.py
Train the KwBiLSTM.
python3 run_classification.py
Start your focused crawling.
python3 run_crawling.py
If you find this work helpful in your research, cite:
@article{Kontogiannis2021TreebasedFW,
title={Tree-based Focused Web Crawling with Reinforcement Learning},
author={A. Kontogiannis and Dimitrios Kelesis and Vasilis Pollatos and Georgios Paliouras and George Giannakopoulos},
journal={ArXiv},
year={2021},
volume={abs/2112.07620}
}