Refactor GDELT mining - Githubissues

The [GDELT project](https://blog.gdeltproject.org/), in the words of the creator himself:

monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

Essentially, it’s a vast, deep and rich data lake. And it can be easy to drown with so much data. So it’s necessary to be very focused on what you want to accomplish with that data, and be very targeted on which exact data points you are interested in.

For us, the criteria of what we want to source, from GDELT and other wells, are as follows:

Protest events that are happening in North America
All attributes that can give us a rich understanding of those events (e.g. geography, time frame, attendance numbers, relationship to other events, etc.)
Media coverage of those events
Actors that are involved in those events
Narratives from those actors

With that in mind, we identified 3 feeds that we want to source from GDELT. All 3 are coming from the [core GDELT 2.0 release](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/):

The events table, which provides an exhaustive list of protest events and attributes - See the relevant section in the [GDELT 2.0 cookbook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf)
The mentions table, which provides an exhaustive list of all media articles that refer to those events
The [Global Knowledge Graph (GKG) table](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf), which provides additional attributes to events through an extraction of counts, themes, etc from media articles.

It should be noted that all 3 feeds above have a english-only source, as well as a translingual one (meaning all other languages, except english). For this version, we are going to work with the english source only and eventually incorporate the translingual version.

Mining strategy

As was mentioned in our introduction, we are to be focused and targeted in how we mine the GDELT data source. For us that means only extracting relevant events first (events table), then relevant media articles (mentions table), and finally relevant media article attributes (GKG table).

That means we need to mine in a sequence, by always only keeping the relevant events, articles and attributes (GKG), which become an additional filter for the downstream mining operation.

For example, we would mine only North American protest events in the first mining operation. The output of that is a list of events, which we now use to only mine the relevant articles for those events.

For this story

At the moment, we are only mining from the events table. But we would like to extend how we mine GDELT for us to get richer data.

To consider this project completed, we want:

[ ] Make filtering of which events to mine a set of configurations accessible in a global yml file
[ ] Using the package configurations, review the events table mining, only keeping the relevant records
[ ] Using the output from the previous step, mine only the relevant records from the mentions table
[ ] Using the output from the previous step, mine only the relevant records from the GKG table

discursus-data / saf_gdelt

Refactor GDELT mining #2

Mining strategy

For this story