IFRCGo / GCDB

Global Crisis Data Bank
Creative Commons Zero v1.0 Universal
1 stars 0 forks source link

ETL Monty

Montandon - Global Crisis Data Bank

Welcome to the repository for the realtime Extraction, Transformation and Loading (ETL) scripts of the Monty database. This repository mostly deals with extracting the data from all of the external databases, whereby it then processes and transforms this data into a common hazard, exposure, impact, and response taxonomy classification system and then pushes this data onto the Monty database. However, this repository also contains the analysis scripts that were required to provide the data for, for example, the World Disaster Report 2024 or the Monty dashboards. Here is where the magic happens!

About Monty

Monty, an abbreviated name for the Montandon - Global Crisis Data Bank, is a database that brings in hazard and impact data for current, historical and forecasted disasters around the globe. By combining lots of different sources of information, Monty aims to fill-in-the-gaps and provide a more complete picture of disaster risk for the National Societies. For more information about the Montandon project, please check out this 5-minute video. You can directly access the Monty data through the API here. Please note that the API code is stored in a separate GitHub repository, here.

Code Layout

Here we describe the layout of the code. The code base is located in the folder RCode. Inside this folder, you will find the following folders and files:

More information about each folder and files is detailed below.

Analysis

This folder contains the scripts used to analyse the Monty data.

Key Files

GCDB_Objects

This folder contains the main elements that define and manipulate Monty objects and instances.

Key Files

MainlyHazardData

This folder contains all of the scripts to access, download and transform all of the hazard-predominant databases.

Key Files

MainlyImpactData

This folder contains all of the scripts to access, download and transform all of the impact-predominant databases

Key Files

Other

These files are a bit of a random eclectic mix of different files that may or may not be useful anymore...

Key Files

Setup

This folder is where all of the basic setup scripts are found. This is where the installation scripts are and some of the fundamental functions that are required to make Monty run.

Key Files

Spatio-Infra-Political

This section is all about political administrative boundaries and other geospatial infrastructural datasets.

Key Files

Installation

Please use the file InstallationChecks.R which can be found in the Setup folder. This has two lines to it: the first installs all of the correct packages and loads all of the scripts into the global environment, the second then tries to make a simple call to extract IFRC EA and DREF dataset and transform it into a Monty instance. Note that the first line should always work, no matter what. However, the second line is not always guaranteed to work as IFRC APIs are the data model of these datasets can be modified or updated and therefore this function may not work in future.

DIY wrangling a new dataset into Monty

For this final section, an explanation is provided on how to extract future datasets such that they conform to the Monty data model. Generally speaking, this is straight-forward, but there are a few key moments whereby it is easy to make a mistake. This section will cover the following: converting the database hazard and impact taxonomies to conform with Monty and then the remainder of what is required in order to convert the rest of the database into a Monty object.

Hazard taxonomy classification

In order to convert the hazard taxonomy from one organisations database to Montys, this conversion takes place using database-specific excel spreadsheets that are located in the folder Taxonomies/ConvertFromDatabases/. There can be many columns to the table in each of these spreadsheets, but there must be the following Monty-specific hazard taxonomy layers: haz_Ab, haz_type, haz_cluster and haz_spec. These are used to essentially do a left-join of the spreadsheet to the Monty data. For each unique taxonomy classification of the external database, the specific hazards from the UNDRR-ISC Hazard Information Profiles (HIPs) must be classified to it. Note that there are almost always multiple HIP specific hazards associated to a single hazard taxonomy of an external database.

Impact taxonomy classification

In order to convert the impact taxonomy from one organisations database to Monty, this conversion takes place using an excel spreadsheet that covers all databases: Taxonomies/ConvertFromDatabases/ConvertImpact_Taxonomy.xlsx. This file, similar to the conversion for the hazard taxonomies, has a line for each impact record that is labelled in the external database and is then converted with respect to the Monty exposure taxonomy, the impact type taxonomy and the impact unit taxonomy. All of these taxonomies can be found in the file ImpactInformationProfiles.xlsx.

Converting to the Monty object data model

The first thing that is required is to transform the metadata of the external database. This is mainly just renaming the variables in the dataframe to match with what it should correspond to the different elements in the Monty. Note that some variables that are located in different dataframes in the Monty are actually nested lists which need to be properly handled when you export to a tabular (e.g. Excel) format. An example of this is the ev_ISO3s, imp_ISO3s and haz_ISO3s variables of the event_info, impact_data and hazard_data elements of the Monty object, respectively. Each disaster record can have multiple countries/territories associated to it, hence the requirement for nested lists. One tricky thing to get around is the inclusion of external IDs that we want to include in the data transferred to the end user. For example, a single entry in GDACS may also refer to a DFO ID and a GLIDE number, which would result in three external IDs (including the GDACS ID). Please use-case examples of the functions Add_hazIDlink_Monty and Add_impIDlink_Monty for some of the databases already wrangled to see how the external ID is handled for hazard and impact data.