ddaedalus / tres

Official code implementation of "Tree-based Focused Web Crawling with Reinforcement Learning" and the TRES framework
21 stars 2 forks source link
focused-crawler gym-environment reinforcement-learning

Tree-based Focused Web Crawling with Reinforcement Learning

Abstract

Results_of_Hardware

A focused crawler aims at discovering as many web pages relevant to a target topic as possible, while avoiding irrelevant ones; i.e. maximizing the harvest rate. Reinforcement Learning (RL) has been utilized to optimize the crawling process, yet it deals with huge state and action spaces, which can constitute a serious challenge. In this paper, we propose Tree REinforcement Spider (TRES), an end-to-end RL-empowered framework for focused crawling. Unlike other crawling approaches, we properly model a crawling environment as a Markov Decision Process, by representing the state as a subgraph of the Web and actions as its expansion edges. Exploiting a few initial keywords, which are related to the target topic, TRES adopts a keyword expansion strategy based on the cosine similarity of keyword word2vec embeddings. To learn a reward function, we propose a deep neural network, called KwBiLSTM, leveraging the keywords discovered in the expansion stage. To reduce the time complexity of selecting a best action, we propose Tree-Frontier, a two-fold decision tree, which also speeds up training by discretizing the state and action spaces. Experimentally, we show that TRES outperforms state-of-the-art methods in terms of harvest rate by at least 58\%, while it has competitive results in the domain maximization setting, i.e. the task of maximizing the number of different fetched web sites.

Also, link to paper (preprint).


Results

Results_of_Hardware

Run locally

  1. Git clone the repo.
git clone https://github.com/ddaedalus/tres.git
  1. After you have extracted the .zip file and changed directory, install the requirements.

    pip install -r requirements.txt
  2. Download the "files" directory from Google Drive.

  3. Create 2 files in the "files" directory: (a) seeds.txt with the seed URLs that the crawler would start its process and (b) data.txt providing relevant URLs (around 800-1000 at most) for training KwBiLSTM.

  4. Modify the ./configuration/config.py file with your preferences.

  5. Insert your keywords (and your keyphrases, if any) in the ./configuration/taxonomy.py

  6. Run this command to extract new keywords.

    python3 keyword_extract.py
  7. Train the KwBiLSTM.

    python3 run_classification.py
  8. Start your focused crawling.

    python3 run_crawling.py

Citation

If you find this work helpful in your research, cite:

@article{Kontogiannis2021TreebasedFW,
  title={Tree-based Focused Web Crawling with Reinforcement Learning},
  author={A. Kontogiannis and Dimitrios Kelesis and Vasilis Pollatos and Georgios Paliouras and George Giannakopoulos},
  journal={ArXiv},
  year={2021},
  volume={abs/2112.07620}
}