The Newsy project is an experiment in helping people better understand current world events.
The goal is to help make news easier to understand. The primary user-facing tool at this point is Worldview, a map-based frontend that overlays a heatmap over the world to indicate how active a given country is in world media compared to the past.
The backend pipelines (Newsy itself, primarily) can be used for a variety of other analysis tasks.
The Newsy project consists of three different projects:
Scraper's only required input is a newapi key, which should be specified via the NEWS_API_KEY
env variable. Once set in motion, the scraper will stay in motion until stopped, and will attempt to download any unrecognized articles once every few minutes.
newsy.py
is the primary data extraction script and processes a news.json
file (produced by scraper
). This process will read in raw HTML and attempt to extract the body text, headline, publication date, authors, source, and other information related to the article. It will also create a vanilla RethinkDB document (no labels) containing raw metadata.
label_country.py
and other label_*
scripts read the available RethinkDB documents and attempt to annotate them with various labels. These processes only label (mutate) existing records and do not have any other side effects.
render.py
generates two different forms of output:
Build dist-ready assets by running gulp && npm build
. Assets will be available in the public/
directory. Note that the frontend depends on data generated from newsy's render.py
to show a populated map.