IBM / mixed-migration-forecasting

Forecasting mixed migration for the Danish Refugee Council.
Apache License 2.0
11 stars 15 forks source link

Source file 'prm-datasets/indicators/worldbank/WDI/WDIData.csv' not found. #22

Open umbe1987 opened 1 year ago

umbe1987 commented 1 year ago

https://github.com/IBM/mixed-migration-forecasting/blob/5047c748b60b3f7c3621e0174200007865cc2933/server/executor.py#L42

Hello and thanks for sharing this repository first of all :)

This line of code in executor.py throws an error since it tries to read a file that does not exist (prm-datasets/indicators/worldbank/WDI/WDIData.csv').

This is set as the source file, is it correct? The folder prm-datasets/indicators does not exist in the project when you download it, whereas prm-datasets/processed folder does. Maybe it should be the other way round (just a guess)?

image

image

rahulnair23 commented 1 year ago

Hi @umbe1987 - executor.py is aimed at standardising the datasets from various sources, i.e. generating the data found in processed.

The main reason to run this would be to (a) update the source data, e.g. more recent datasets, or (b) include additional datasets. If you are updating the source data, you need to provide the raw datasets yourself. If you are including additional data, you can pass your own configuration, similar to configuration.json.

The source/raw data files have not been packaged with the application.

umbe1987 commented 1 year ago

Thanks for your feedback @rahulnair23

My aim was to update the datasets. Let me please try to see if I understood correctly.

This project does not allow me to download new data, but to standardise new raw data, is it correct?

If this is the case, is there a document with the various links to download the source (raw) files somewhere (those that were used to generated the ones in the processed folder that I would need to update)?

Also, I guess I am supposed to create a folder prm-datasets/indicators and place there the various raw data with the exact names I see in configuration.json, is that correct?

Thanks in advance for clarifying.

rahulnair23 commented 1 year ago

(with apologies for the late response).

This project does not allow me to download new data, but to standardise new raw data, is it correct?

The reason is that we can't redistribute the source data without adequate permissions. While most datasets are openly available (e.g. UNHCR, WHO, Worldbank), some others are not (e.g. EMDAT).

If this is the case, is there a document with the various links to download the source (raw) files somewhere (those that were used to generated the ones in the processed folder that I would need to update)?

Have a look at configuration.json which is probably the most descriptive of the (raw) sources. We do not have direct URL links unfortunately. Data publishing in this sector typically is ad hoc and may not be consistent year on year (with exceptions).

Also, I guess I am supposed to create a folder prm-datasets/indicators and place there the various raw data with the exact names I see in configuration.json, is that correct?

Yes, you can edit the configuration to remove any sources you don't find/want. Or the base path of the source files. You can include additional sources and specify particular any custom transformers as well.