datawizard1337 / ARGUS

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On the websites, ARGUS is able to perform tasks like scraping texts or collecting hyperlinks between websites. See: https://link.springer.com/article/10.1007/s11192-020-03726-9
GNU General Public License v3.0
88 stars 25 forks source link
crawling python scraping scrapy scrapyd webcrawling webscraping

ARGUS: Automated Robot for Generic Universal Scraping

ARGUS is an easy-to-use web scraping tool. The program is based on the Scrapy Python framework and is able to crawl a broad range of different websites. On these websites, ARGUS performs tasks like scraping texts or collecting hyperlinks between websites. See related paper: https://link.springer.com/article/10.1007/s11192-020-03726-9

Here you can find two scientific papers using ARGUS scraped web data: "Predicting Innovative Firms using Web Mining and Deep Learning": http://ftp.zew.de/pub/zew-docs/dp/dp19001.pdf "The Digital Layer: How innovative firms relate on the Web": http://ftp.zew.de/pub/zew-docs/dp/dp20003.pdf

Getting Started

These instructions will get you a copy of ARGUS up and running on your local machine.

Follow these easy steps, which are described in more detail below, to make a successful ARGUS scraping run:

  1. Install Python 3.6 or newer
  2. Install additional Python packages (see Prerequisites below).
  3. Install cURL and add a cURL environment variable to your system (see below).
  4. Download and extract the ARGUS files.
  5. Start scraping via ARGUS.exe or the ARGUS_noGUI.py file.
  6. Check the scraping process using the web interface and wait until it is finished.
  7. Run postprocessing from ARGUS.exe.

Prerequisites

ARGUS works with Python 3.6 or higher, is based on the Scrapy framework and has the following Python package dependencies:

Installation of scrapyd requires you to install C++ Build Tools first. Additionally, you need cURL to communicate with the ARGUS user interface. An executable Windows 64bit version of cURL can be downloaded here, for example. After downloading and extracting, you need to add a cURL environment variable to your system. See this Stackoverlow thread if you do not know how to do that.

Installing

If you are not using Python yet, the easiest way to install Python 3.6 and most of its crucial packages is to use the Anaconda Distribution. Make sure that a Python environment variable was set during the installation. See this Stackoverlow thread if you do not know how to do that. Do the same for the Anaconda script folder in "Anaconda3\Scripts". After installing Anaconda, you can use pip to install the packages above by typing “pip install package_name” (e.g., “pip install scrapy”) into your system command prompt.

Start Scraping

Warning: Please make sure that ARGUS and your URL file are not located in directories with paths containing ".", e.g. "C:\my.folder\data.txt"!

Either use the ARGUS GUI to start your scraping run:

ARGUS_GUI

Alternatively you may edit the parameters in the ARGUS_noGUI.py file and then start it with the shell command python ./ARGUS_noGUI.py 'c:/my_urls.txt'.

ARGUS_GUI

File Settings

All parameters that are necessary are marked with an asterisk ***** in the GUI. The other parameters are optional.

example url list

Web Scraper Settings

Advanced Settings

Start Scraping

Hit Start Scraping when all your settings are correct. This will open up a seperate Scrapy server that should not be closed during the following scraping run.

Your list of URLs will be split into handy chunks and a separate job will be started for each chunk to speed up the scraping process. After all jobs were scheduled, the scrapyd web interface will open up in your default web browser (you can also get there by typing “http://127.0.0.1:6800/” into your web browser).

scrapyd server

Stopping jobs

Sometimes certain jobs stop working or never finish, so you may want to stop and restart them. This can be done by hitting Terminate Job. You will be asked for the ID of the job you want to cancel. The ID is a long hash number which can be found in the “Job” column in the “Jobs” web interface section. scrapyd jobs

You can stop all processes at once by clicking Stop Scraping. You will be asked whether you want to delete the data that has already been scraped. If you decide against deleting the scraped data, you may want to run Postprocessing to process your already scraped data (see below).

Postprocessing

When all jobs are finished (or you decided to Stop Scraping), you need to hit Postprocessing which writes your scraped data to the directory of your input data and does some clean up. Depending on the size of your output, this might take some time.

Aggregate Webpage Texts

By default, texts downloaded will be stored at the webpage level (see Output data below). If you need your texts aggregated at the website level, run Aggregate Webpage Texts.

Output data

The output file can be found in the same directory your original website address file is located.

One row equals one webpage and n (n ≤ Scrape limit) webpages equal one website (identified by its ID).

ARGUS textspider output

How ARGUS works

An ARGUS crawl is based on a list of user given firm website addresses (URL) and proceeds as follows:

  1. The first webpage (a website’s main page) is requested using the first address in the given URL list.
  2. A collector item is instantiated, which is used to collect the website’s content, meta-data (e.g. timestamps, number of scraped URLs etc.) and a so-called URL stack.
  3. The main page is processed:
    • Content from the main page is extracted and stored in the collector item.
    • URLs which refer to subpages of the same website (i.e. domain) are extracted and stored in the collector item’s URL stack.
  4. The algorithm continues to request subpages of the website using URLs from the URL stack. Hereby, it can use a simple heuristic which gives higher priority to short URLs and those URLs which refer to subpages in a predefined language.
    • Content and URLs are collected from the subpage and stored in the collector item.
    • The next URL in the URL stack is processed.
  5. The algorithm stops processing a domain when all subpages or a predefined number of subpages per domain have been processed.
  6. The collected content is processed and written to an output file.
  7. The next website is processed by requesting the next URL from the user given URL list. The described process continues until all websites from the user given list have been processed.

FAQ

Why does ARGUS.exe not open?

ARGUS opens and starts scraping, but there is no output. What is the problem?

Why does ARGUS not open on my Unix-based system (e.g. Mac)?

I get an error message stating that ‘ID invalid’ (or similar). What is the issue?

One job is running for hours without scraping any data. What went wrong?