datasets / gdp

Country, regional and world GDP in current US Dollars ($)
https://datahub.io/core/gdp
72 stars 57 forks source link

running process.py to update gdp.csv #2

Closed bgrayburn closed 9 years ago

bgrayburn commented 9 years ago

running process.py to update gdp.csv

rufuspollock commented 9 years ago

Great to have this - thank-you.

As per population any idea why so many changes to the file (so many that it is actually impossible to view the diff).

bgrayburn commented 9 years ago

Yeah I was hoping to see the diff too. I'm guessing the large diff numbers are double counting lines (removing then adding the same lines), I've seen this happen before (in smaller numbers) with code. Since I've rerun these files, I've used the new data (both GDP and population) as inputs to some old code successfully so I'm confident the formatting is at least correct.

It is also possible the population estimates were updated by the World Bank, but this seems unlikely to account for the magnitude of changes reported. On Aug 17, 2015 05:10, "Rufus Pollock" notifications@github.com wrote:

Great to have this - thank-you.

As per population any idea why so many changes to the file (so many that it is actually impossible to view the diff).

— Reply to this email directly or view it on GitHub https://github.com/datasets/gdp/pull/2#issuecomment-131739482.

rufuspollock commented 9 years ago

@bgrayburn generally we should still not get so many changes. My one thought is to do with running with different line endings or similar (e.g. we have accidentally ended up switching from unix line endings to dos or similar).

What does git diff show you? Also the tip at the bottom of http://data.okfn.org/doc/csv may be useful.

bgrayburn commented 9 years ago

thanks! that diff tip was awesome, I've always loathed using git for data for this very reason!

here's a snippet of that diff (I scrolled to a country instead of a region to try to validate using google):

Argentina,ARG,1991,[-189719989668.103^M-]{+189719990237.65^M+}
Argentina,ARG,1992,[-228779382768.151^M-]{+228779383808.031^M+}
Argentina,ARG,1993,[-236753563469.871^M-]{+236753564542.77^M+}
Argentina,ARG,1994,[-257439956992^M-]{+257440000000^M+}
Argentina,ARG,1995,[-258031878144^M-]{+258031750000^M+}
Argentina,ARG,1996,[-272149757952^M-]{+272149750000^M+}
Argentina,ARG,1997,[-292858888192^M-]{+292859000000^M+}
Argentina,ARG,1998,[-298948362240^M-]{+298948250000^M+}
Argentina,ARG,1999,[-283523022848^M-]{+283523000000^M+}
Argentina,ARG,2000,[-284203745280^M-]{+284203750000^M+}
Argentina,ARG,2001,[-268696715264^M-]{+268696750000^M+}
Argentina,ARG,2002,[-102040334258.58^M-]{+102040287018.716^M+}
Argentina,ARG,2003,[-129597103033.807^M-]{+129597154100.492^M+}
Argentina,ARG,2004,[-153129481873.143^M-]{+183295704170.331^M+}
Argentina,ARG,2005,[-183193408940.742^M-]{+222910837452.42^M+}
Argentina,ARG,2006,[-214066231201.821^M-]{+263042487638.256^M+}
Argentina,ARG,2007,[-260768703249.434^M-]{+329761479745.779^M+}
Argentina,ARG,2008,[-326582808527.135^M-]{+406003733991.082^M+}
Argentina,ARG,2009,[-307155148184.324^M-]{+378506370535.235^M+}
Argentina,ARG,2010,[-368736062143.669^M-]{+462843782844.29^M+}
Argentina,ARG,2011,[-446044143596.268^M-]{+559849040366.128^M+}

At first glance it looks like World Bank is changing their estimates. The disconcerting part to me is that a quick google for population Argentina 2001 shows a population of 40.73mil based on world bank numbers. This doesn't jive with the old or new data here. Any idea what's going on?

rufuspollock commented 9 years ago

This is the gdp repo - population is the other one. My guess here is that this is "real" gdp and they are deflating the data with different year on each release (a common issue). For more recent years these may actually be adjustments but some of the adjustments are pretty massive. Hmmm.

bgrayburn commented 9 years ago

whoops! sorry multitasking is bad. The newer numbers align more accurately with what google is suggesting for gdp for several years of argentine gdp. Can't say I can explain the big change between datasets though. Will continue pondering, haven't found any mention of the update in their list of updates here but there's a lot there.

Just shot an email to data@worldbank.org asking for help. Will update.

rufuspollock commented 9 years ago

OK, good to ask them. In the mean time. I think good to merge though is there a chance to give me a bit more detail for the commit message - e.g. does this cover new years or ...

bgrayburn commented 9 years ago

This includes dates up to 2014 and (speculatively) updates historical values

rufuspollock commented 9 years ago

@bgrayburn what year did we have before (i.e. up until what year was the data before this update)?

bgrayburn commented 9 years ago
  1. Min year for both is 1960
rufuspollock commented 9 years ago

What I meant was: before you did the update what was the latest date we had? I.e. what new years are there now with the update?

bgrayburn commented 9 years ago

sorry, that 1. was supposed to read 2010, thanks for the patience. To be clear, 2010 was the latest date prior to this pull request

bgrayburn commented 9 years ago

world bank got back to me. I asked: "Sorry for the basic question, but I was just comparing a pull of GDP values from World Bank made in July of last year to a pull made now. It seems as though a large number of historical values have changed. Is this due to a change in how estimates are made? I've looked here (http://data.worldbank.org/about/data-updates-errata) but couldn't find any specific mention of updating gdp values.

For example: Argentina 2004: pull from last year says GDP is 153129481873.143 pull from now says GDP is 183295704170.331

Any idea why the discrepancy?"

they responded: "All data in the WDI database is reviewed and revised (where necessary) each quarter. Revisions are due to new data coming in, usually. We advise users to go with the most recent dataset as it is the most accurate. I hope this helps."

not the most specific answer, but makes sense as an explanation, I'm not going to follow up on population because their answer seems to cover that dataset as well.

rufuspollock commented 9 years ago

But has old data changed too? That would be kind of odd if 1960 data had changed ...

bgrayburn commented 9 years ago

Old data has changed, indeed I am seeing changes back to 1960 (ex. Caribbean small states in 1960, old reported gdp: 1859229137.12282, new reported gdp: 1859229127.1255)

My suspicion would be that the underlying way GDP is calculated(/estimated) has changed over time, and to keep the dataset consistent, old values are revised, or as the world bank's response indicates, new data about past years has been used to adjust historical values.

rufuspollock commented 9 years ago

@bgrayburn would it be worth asking them about that?

bgrayburn commented 9 years ago

@rgrp No problem, I will specifically mention the case I just mentioned to you and also ask if the underlying way GDP is calculated has changed, but I think they've already expressed that new data can impact previous estimates.

On Mon, Aug 17, 2015 at 12:17 PM, Rufus Pollock notifications@github.com wrote:

@bgrayburn https://github.com/bgrayburn would it be worth asking them about that?

— Reply to this email directly or view it on GitHub https://github.com/datasets/gdp/pull/2#issuecomment-131877144.

bgrayburn commented 9 years ago

@rgrp World bank response: "Yes, all historical data is also frequently revised. Calculaion does change sometimes; for example the base year used for currencies can change (last year we switched to 2005 constant USD as a base year for example). You can find this information in the Data Updates and Errata page, or in the metadata (downloadable from DataBank)."

rufuspollock commented 9 years ago

Thanks @bgrayburn :-) Please keep 'em coming and great work digging into the changes and getting clarity!