devinit / digital-platform

PostgreSQL/analyst → MongoDB → Development Data Hub
http://data.devinit.org:8888/#!/ & http://data.devinit.org/#!/
3 stars 12 forks source link

Data Warehouse check for all automated data series complete, @robtew & @timstrawson to sign off #248

Closed dw8547 closed 7 years ago

dw8547 commented 8 years ago

Good morning @xriss, @notshi, @bill-anderson, @timstrawson, @robtew, @jenclaydon. I have now finished checking the content in all of the automated data series in DDW that replace the corresponding GitHub files from https://github.com/devinit/digital-platform/tree/master/country-year. I have documented the discrepancies that I have found as % error, DH relative to DDW i.e., we are assuming DH contains the right data (we know it does not in some instances). I now hand over to @robtew & @timstrawson. We'll work on resolving any issues they have with the data checks and what they bring up. This is a summary of where we are up to & what came out in the wash.

I have checked:

  1. OECD CRS ODA data series with unbundled aid fact.oda_2012 ⇔ https://github.com/devinit/digital-platform/blob/master/country-year/oda.csv % errors documented here: https://github.com/devinit/digital-platform/issues/243
  2. OECD CRS ODA individual donor data series fact.oda_donor_2012 ⇔ all files in https://github.com/devinit/digital-platform/tree/master/country-year/oda-donor % errors documented here: https://github.com/devinit/digital-platform/issues/247
  3. World Bank (WB) World Development Indicators (WDI) based data series 1) fact.gdp_usd_current ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/gdp-usd-current.csv 2) fact.gni_pc_usd_current ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/gni-pc-usd-current.csv 3) fact.gni_usd_current ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/gni-usd-current.csv 4) fact.income_share_bottom_20pc ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/income-share-bottom-20pc.csv 5) fact.income_share_by_quintile ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/income-share-by-quintile.csv 6) fact.life_expectancy_at_birth ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/life-expectancy-at-birth.csv 7) fact.maternal_mortality ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/maternal-mortality.csv 8) fact.population_total ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-total.csv 9) fact.population_rural ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-rural.csv 10) fact.population_urban ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-urban.csv 11) fact.population_rural_urban ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-rural-urban.csv 12) fact.population_by_age ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-by-age.csv 13) fact.population_by_age_0_14 ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-0-14.csv 14) fact.population_by_age_15_64 ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-15-64.csv 15) fact.population_by_age_65_and_above ⇔ https://github.com/devinit/digital-platform/tree/master/country-year/population-65-.csv % errors documented here: https://github.com/devinit/digital-platform/issues/246

We have tried to keep the names of the tables the same as the name of the .csv files with the following exceptions:

no .csv file name ddw table name
1 oda.csv fact.oda_2012
2 oda-donor/oda-[id_from].csv fact.oda_donor_2012 WHERE from_di_id = 'id_from'
3 population-0-14.csv fact.population_by_age_0_14
4 population-15-64.csv fact.population_by_age_15_64
5 population-65-.csv fact.population_by_age_65_and_above
6 total-revenue-pct-GDP.csv data_series.total_revenue_pct_gdp
7 total-revenue-PPP-capita.csv data_series.total_revenue_ppp_capita

In 6 & 7 we've changed uppercase to lowercase only.

It is my understanding that 205 of the .csv files in https://github.com/devinit/digital-platform/tree/master/country-year/ are actually used by the DH. So far, we have automated 77 of these (38%). Note that fact.oda_donor_2012 accounts for 62 of these alone. Because we want the DH to use the DDW as its data source and sever the link with the GitHub repository, we read the remaining 128 .csv files into tables.

Therefore, the above data checks concern only the automated data series i.e., the 77 & they all check out.

As the remaining tables were created from the .csv files without applying any changes to them, there is not need to data check them. With those, the only possible source of error is me having read in the wrong file into a table.

Note there is a number of outstanding, unresolved issues concerning the development of the DDW which affect its contents. They are:

1) https://github.com/devinit/ddw-data/issues/162. This affects fact.oda_2012 and leads to 25 rows where 'channel' = NULL. 2) https://github.com/devinit/ddw-data/issues/159. This affects the conversion of values given in current prices to to constant prices. The conversion rules were changed with a very short notice & I have not yet been implemented. 3) The new price conversion rules rely on World Bank (WB) delfators which have not yet been implemented (https://github.com/devinit/ddw-data/issues/135). 4) @timstrawson is to provide updated .csv files with 2014 data to be used with non automated tables concerning Wider Resource Flows. 5) @elmetodo is to provide updated .csv files with 2014 data to be used with non automated tables concerning Domestic Resources.

With all this in mind, we may not be able to updated all of the tables that are not automated and some may not contain 2014 data until a later, as of yet unspecified date.

There are two outstanding enhancements/data checks:

1) https://github.com/devinit/ddw-data/issues/151. Investigating error intriduced by rounding, making sure total $ ODA is the same in fact.oda & fact.oda_donor. 2) https://github.com/devinit/ddw-data/issues/152. Investigating missing number of people.

robtew commented 8 years ago

I have checked part 1 (ODA series with unbundled aid) and can confirm that this is all OK. @dw8547 and I have also checked the current/constant price conversion which also works fine.

robtew commented 8 years ago

There is a potential issue with part 2. If these are the files used to drive the ODA part of the resource bar chart in the donor profiles, then they have to include multilateral as well as bilateral ODA. The current files only include bilateral ODA. We therefore need to clarify which visualisation(s) these individual donor files are being used for and re-do with multilateral ODA if necessary. This is urgent.

robtew commented 8 years ago

I have checked and the files from part 2 do only need to count bilateral ODA, so these are fine after all.

robtew commented 8 years ago

Have now checked the WDI-sourced files from part 3 of this issue. I can confirm that these files are also OK. Therefore all files named in this issue can be signed off.

dw8547 commented 8 years ago

Hi @xriss, @notshi, @AnnaAbuhelal, @RebeccaHills, @timstrawson, @robtew & @bill-anderson.

@robtew has now checked & OKed all of the automated data series in ddw. I have therefore uploaded an updated ddw DB to 213.168.251.124 with all the automated tables which are to be used in the file replacements as in https://github.com/devinit/digital-platform/issues/248#issue-131321023 & issue #245.

We have also updated 5 of the static tables in schema data_series. They are:

1) domestic 2) gov_revenue_pc_gdp 3) grants_pct_totalrevenue 4) intl_flows_donors 5) intl_flows_recipeints

@xriss & @notshi, please get in touch if there are any problems.

notshi commented 8 years ago

Thanks, @dw8547 will do.

dw8547 commented 7 years ago

Dead issue.