anyweez / newsy

News content extraction.
0 stars 0 forks source link

The Newsy project

The Newsy project is an experiment in helping people better understand current world events.

High-level goals

The goal is to help make news easier to understand. The primary user-facing tool at this point is Worldview, a map-based frontend that overlays a heatmap over the world to indicate how active a given country is in world media compared to the past.

The backend pipelines (Newsy itself, primarily) can be used for a variety of other analysis tasks.

System overview

The Newsy project consists of three different projects:

Running the pipeline

scraper

Scraper's only required input is a newapi key, which should be specified via the NEWS_API_KEY env variable. Once set in motion, the scraper will stay in motion until stopped, and will attempt to download any unrecognized articles once every few minutes.

newsy

newsy.py is the primary data extraction script and processes a news.json file (produced by scraper). This process will read in raw HTML and attempt to extract the body text, headline, publication date, authors, source, and other information related to the article. It will also create a vanilla RethinkDB document (no labels) containing raw metadata.

label_country.py and other label_* scripts read the available RethinkDB documents and attempt to annotate them with various labels. These processes only label (mutate) existing records and do not have any other side effects.

render.py generates two different forms of output:

worldview

Build dist-ready assets by running gulp && npm build. Assets will be available in the public/ directory. Note that the frontend depends on data generated from newsy's render.py to show a populated map.