bryankolano / gdelt_pipeline_google_cloud

To get some practice with the ETL tool "Prefect", this repo grabs events from the Global Database of Events, Language, and Tone (GDELT) and move them through an ETL pipeline to eventually end in Google Big Query.
0 stars 0 forks source link

ETL Pipeline of GDELT data using Prefect and Google Cloud Platform (GCP)

A series of python scripts that grab data from GDELT database and write to GCP

By: Bryan Kolano, Original repo creation: March 20th, 2023


Background

Getting into data engineering has made me start to explore various ETL tools. In the past few months, I started to learn a little about the ETL tool: Prefect. I have used GDELT in the past for a class I used to teach in R. I thought GDELT would be a good dataset to use in a pipeline with Prefect.

Files in this repo

Data cleaning

I discovered that there are some data quality issues in the GDELT data. Some of the issues I found and fixed include:

Steps for Prefect

To take advantage of the functionality of Prefect, the following Prefect steps were taken:

  1. start Prefect Orion server
  2. Create Google Credentials block
  3. Create Google Cloud Storage block
  4. Create Google Big Query block
  5. Create Prefect deployment
  6. Schedule Prefect deployment to run each day for previous day's data