DevLab-Duke / mlp-data-intro

Repository for: Tracking Civic Space in Developing Countries with a High-Quality Corpus of Domestic Media and Language Models
0 stars 0 forks source link

Required Data #1

Open jrspringman opened 1 month ago

jrspringman commented 1 month ago

We need the following data, as well as scripts that pull this data from the respective folders/repositories. With the exception of the shock detection results, this should all be in the final, processed data that is ingested by forecast-surges-pipeline. We would just need to delete the TE data.

For the civic space and RAI data, there's an easy approach and a harder approach.

Civic Space Data

Easy Approach An adaptation of this script should work:

It includes raw, normalized, article total, and source entry flags. It also includes the TE data, which needs to be removed. So you'll need to figure out a simple way to scrub that from the country-datasets as you bind them together. The only thing you cannot get from this is RAI variables that disaggregate each specific indicator for Russia/China.

Harder but better approach

Adding a few lines to the ML4P-Civic-Space-Forecasting processing would be better. You could just find the part of the code that writes-out to the ml4p.forecasting/2-model-data dropbox folder and write out a slightly different version without the TE data.

Shock Detection Results

This data is stored in data subfolders within forecast-surges-pipeline/data. I think the best method will be to add a line of code in the python script that outputs to a subfolder in the ml4p.forecasting dropbox folder. This way, you can avoid the dated subfolders, and just pull from there.

Disaggregate by domestic/regional+international

Do you have code that does this? For my Remedios project, I have a repo that takes a modified version of the core functions for the ml4p.forecast package and writes-out source-level data. That might be a helpful starting point. Let me know if you want me to add you to the repo. You'll be looking at code/sample_frame.R and code/mlp_functions.R

RAI Data

At some point, I need to modify the rai.atari package to output the disaggregated Russia/China results.

dmoratz commented 1 month ago

I've created a data_update.R that adds the civic space counts data. I did this by piggybacking off the ML4P-Civic-Space-Forecasting infrastructure. So updating this data will require first running update data for that and then running update_data.R for this.

jrspringman commented 2 weeks ago

I pulled-in the Shock detection data by hand. We need to add something to the data_update.R that will pull this from the shock repo automatically (or add something to the shock repo that publishes shock data to our collective dropbox folder and then pull from there.).