discursus-data / saf_gdelt

GDELT resource library for the Social Analytics Framework
MIT License
1 stars 0 forks source link

Refactor GDELT mining #2

Closed olivierdupuis closed 2 years ago

olivierdupuis commented 2 years ago

The [GDELT project](https://blog.gdeltproject.org/), in the words of the creator himself:

monitors the world's broadcast, print, and web news from nearly every corner of every country in over 100 languages and identifies the people, locations, organizations, themes, sources, emotions, counts, quotes, images and events driving our global society every second of every day, creating a free open platform for computing on the entire world.

Essentially, it’s a vast, deep and rich data lake. And it can be easy to drown with so much data. So it’s necessary to be very focused on what you want to accomplish with that data, and be very targeted on which exact data points you are interested in.

For us, the criteria of what we want to source, from GDELT and other wells, are as follows:

With that in mind, we identified 3 feeds that we want to source from GDELT. All 3 are coming from the [core GDELT 2.0 release](https://blog.gdeltproject.org/gdelt-2-0-our-global-world-in-realtime/):

  1. The events table, which provides an exhaustive list of protest events and attributes - See the relevant section in the [GDELT 2.0 cookbook](http://data.gdeltproject.org/documentation/GDELT-Event_Codebook-V2.0.pdf)
  2. The mentions table, which provides an exhaustive list of all media articles that refer to those events
  3. The [Global Knowledge Graph (GKG) table](http://data.gdeltproject.org/documentation/GDELT-Global_Knowledge_Graph_Codebook-V2.1.pdf), which provides additional attributes to events through an extraction of counts, themes, etc from media articles.

It should be noted that all 3 feeds above have a english-only source, as well as a translingual one (meaning all other languages, except english). For this version, we are going to work with the english source only and eventually incorporate the translingual version.

Mining strategy

As was mentioned in our introduction, we are to be focused and targeted in how we mine the GDELT data source. For us that means only extracting relevant events first (events table), then relevant media articles (mentions table), and finally relevant media article attributes (GKG table).

That means we need to mine in a sequence, by always only keeping the relevant events, articles and attributes (GKG), which become an additional filter for the downstream mining operation.

For example, we would mine only North American protest events in the first mining operation. The output of that is a list of events, which we now use to only mine the relevant articles for those events.

image

For this story

At the moment, we are only mining from the events table. But we would like to extend how we mine GDELT for us to get richer data.

To consider this project completed, we want: