Closed bgrayburn closed 9 years ago
Great to have this - thank-you.
As per population any idea why so many changes to the file (so many that it is actually impossible to view the diff).
Yeah I was hoping to see the diff too. I'm guessing the large diff numbers are double counting lines (removing then adding the same lines), I've seen this happen before (in smaller numbers) with code. Since I've rerun these files, I've used the new data (both GDP and population) as inputs to some old code successfully so I'm confident the formatting is at least correct.
It is also possible the population estimates were updated by the World Bank, but this seems unlikely to account for the magnitude of changes reported. On Aug 17, 2015 05:10, "Rufus Pollock" notifications@github.com wrote:
Great to have this - thank-you.
As per population any idea why so many changes to the file (so many that it is actually impossible to view the diff).
— Reply to this email directly or view it on GitHub https://github.com/datasets/gdp/pull/2#issuecomment-131739482.
@bgrayburn generally we should still not get so many changes. My one thought is to do with running with different line endings or similar (e.g. we have accidentally ended up switching from unix line endings to dos or similar).
What does git diff show you? Also the tip at the bottom of http://data.okfn.org/doc/csv may be useful.
thanks! that diff tip was awesome, I've always loathed using git for data for this very reason!
here's a snippet of that diff (I scrolled to a country instead of a region to try to validate using google):
Argentina,ARG,1991,[-189719989668.103^M-]{+189719990237.65^M+}
Argentina,ARG,1992,[-228779382768.151^M-]{+228779383808.031^M+}
Argentina,ARG,1993,[-236753563469.871^M-]{+236753564542.77^M+}
Argentina,ARG,1994,[-257439956992^M-]{+257440000000^M+}
Argentina,ARG,1995,[-258031878144^M-]{+258031750000^M+}
Argentina,ARG,1996,[-272149757952^M-]{+272149750000^M+}
Argentina,ARG,1997,[-292858888192^M-]{+292859000000^M+}
Argentina,ARG,1998,[-298948362240^M-]{+298948250000^M+}
Argentina,ARG,1999,[-283523022848^M-]{+283523000000^M+}
Argentina,ARG,2000,[-284203745280^M-]{+284203750000^M+}
Argentina,ARG,2001,[-268696715264^M-]{+268696750000^M+}
Argentina,ARG,2002,[-102040334258.58^M-]{+102040287018.716^M+}
Argentina,ARG,2003,[-129597103033.807^M-]{+129597154100.492^M+}
Argentina,ARG,2004,[-153129481873.143^M-]{+183295704170.331^M+}
Argentina,ARG,2005,[-183193408940.742^M-]{+222910837452.42^M+}
Argentina,ARG,2006,[-214066231201.821^M-]{+263042487638.256^M+}
Argentina,ARG,2007,[-260768703249.434^M-]{+329761479745.779^M+}
Argentina,ARG,2008,[-326582808527.135^M-]{+406003733991.082^M+}
Argentina,ARG,2009,[-307155148184.324^M-]{+378506370535.235^M+}
Argentina,ARG,2010,[-368736062143.669^M-]{+462843782844.29^M+}
Argentina,ARG,2011,[-446044143596.268^M-]{+559849040366.128^M+}
At first glance it looks like World Bank is changing their estimates. The disconcerting part to me is that a quick google for population Argentina 2001
shows a population of 40.73mil based on world bank numbers. This doesn't jive with the old or new data here. Any idea what's going on?
This is the gdp repo - population is the other one. My guess here is that this is "real" gdp and they are deflating the data with different year on each release (a common issue). For more recent years these may actually be adjustments but some of the adjustments are pretty massive. Hmmm.
whoops! sorry multitasking is bad. The newer numbers align more accurately with what google is suggesting for gdp for several years of argentine gdp. Can't say I can explain the big change between datasets though. Will continue pondering, haven't found any mention of the update in their list of updates here but there's a lot there.
Just shot an email to data@worldbank.org asking for help. Will update.
OK, good to ask them. In the mean time. I think good to merge though is there a chance to give me a bit more detail for the commit message - e.g. does this cover new years or ...
This includes dates up to 2014 and (speculatively) updates historical values
@bgrayburn what year did we have before (i.e. up until what year was the data before this update)?
What I meant was: before you did the update what was the latest date we had? I.e. what new years are there now with the update?
sorry, that 1. was supposed to read 2010, thanks for the patience. To be clear, 2010 was the latest date prior to this pull request
world bank got back to me. I asked: "Sorry for the basic question, but I was just comparing a pull of GDP values from World Bank made in July of last year to a pull made now. It seems as though a large number of historical values have changed. Is this due to a change in how estimates are made? I've looked here (http://data.worldbank.org/about/data-updates-errata) but couldn't find any specific mention of updating gdp values.
For example: Argentina 2004: pull from last year says GDP is 153129481873.143 pull from now says GDP is 183295704170.331
Any idea why the discrepancy?"
they responded: "All data in the WDI database is reviewed and revised (where necessary) each quarter. Revisions are due to new data coming in, usually. We advise users to go with the most recent dataset as it is the most accurate. I hope this helps."
not the most specific answer, but makes sense as an explanation, I'm not going to follow up on population because their answer seems to cover that dataset as well.
But has old data changed too? That would be kind of odd if 1960 data had changed ...
Old data has changed, indeed I am seeing changes back to 1960 (ex. Caribbean small states in 1960, old reported gdp: 1859229137.12282, new reported gdp: 1859229127.1255)
My suspicion would be that the underlying way GDP is calculated(/estimated) has changed over time, and to keep the dataset consistent, old values are revised, or as the world bank's response indicates, new data about past years has been used to adjust historical values.
@bgrayburn would it be worth asking them about that?
@rgrp No problem, I will specifically mention the case I just mentioned to you and also ask if the underlying way GDP is calculated has changed, but I think they've already expressed that new data can impact previous estimates.
On Mon, Aug 17, 2015 at 12:17 PM, Rufus Pollock notifications@github.com wrote:
@bgrayburn https://github.com/bgrayburn would it be worth asking them about that?
— Reply to this email directly or view it on GitHub https://github.com/datasets/gdp/pull/2#issuecomment-131877144.
@rgrp World bank response: "Yes, all historical data is also frequently revised. Calculaion does change sometimes; for example the base year used for currencies can change (last year we switched to 2005 constant USD as a base year for example). You can find this information in the Data Updates and Errata page, or in the metadata (downloadable from DataBank)."
Thanks @bgrayburn :-) Please keep 'em coming and great work digging into the changes and getting clarity!
running process.py to update gdp.csv