EPU index
At the Applied Data Mining research group at the University of Antwerp, a
classifier was developed to classify news articles as Economic Policy Uncertainty (EPU) related or not. The EPU index is
the number of EPU related articles per day divided by the number of news journals that were crawled. In order to get a
daily update of the EPU index, a number of scrapers were developed that can scrape Belgian (Flemish) news jounals every
day. The resulting EPU index data is available to the broad public here.
The application consists of 3 parts: the web scrapers, a web application and a front end.
- Scrapers: 8 scrapers were developed using Python's Scrapy framework. Scrapy is well documented
here and a tutorial will guide
you through Scrapy's main concepts. The crawlers that do the actual crawling work are called spiders and the spiders
that were developed are documented here.
- Web application: The web application is the place where all articles and their EPU classification scores
are stored. The web application is developed using Django and the most important part
are the models. The data in the web application is served using Django's REST
framework to the front end.
- Front end: The front end contains purely html and JavaScript and uses the C3 and
d3-cloud libraries. The data that is needed for generating the charts are
fetched from the web applications REST end points.
Installation
Check out the installation documentation.
Configuration
The application allows for some configurable parameters. Most notably:
- Journal authentication settings: these should be set in the crawling settings
file. See the crawlers documentation
for more information about those settings.
- Period and term to scrape: these can also be found in the crawling settings
file.
- Model file: this file should contain a comma separated list of words and their weights to be used to score an
article. It should include a header
word,weight
. This setting in the crawling settings
file points to the model file. Since this models file is used
by the scraper, only newly scraped articles will be affected when a new file is used.
- EPU Score Cutoff: this cutoff defines at which score articles are considered positive. You can alter this cutoff
in the models file but note that you will have to re-run the custom django command
calculate_daily_epu
for all dates already in the database.
- Stopwords: the stopwords are documented as a tuple in the models file. To generate
such a tuple from a text file, you can use the stand alone script stopwords_to_tuple.py and
paste the result in the models file.
- Email notifications: the application will check every day whether the scrapers are still working. This is done by
checking for a number of consecutive days that no articles were returned. This cutoff is defined
here and the email recipients should be added as the
ALERT_EMAIL_TO
setting in the
same file. A number of other settings regarding the email alerts are set in production only (host, port, etc.).