PardeeCenterDU / IFs-Issues-Tracking

This repository only holds the list of bugs that have been reported for IFs. Anyone may add a bug report, but please look to see if your issue has already been added!
2 stars 0 forks source link

Historical data question #325

Open jonathandmoyer opened 2 weeks ago

jonathandmoyer commented 2 weeks ago

Hi Yutang: I have a question about the process for updating historical data in light of the mental health data and also a series that looks like some historical data went missing. Energy intensity--two variables in the plot below--used to have data that went back into the 80s. You could see, for example, how soviet and chinese production were really energy inefficient. Do you know where those data might have gone? Can you do some digging to see if there's a previous version that had the data and then somehow it was erased? I'm sure it was erased because WDI doesn't keep it in their database because they think the data are poor. That's fine, but I don't think we should simply erasing data if WDI does because we might want it for a different purpose. So the question is: when we update data and there are changes to the historical record from the source, what do you do? And in light of the mental health question: when you see big changes to a data series that's important for the model, what do you do?

chart - 2024-07-05T111132 242

quciet commented 2 weeks ago

Hi Jon, @jonathandmoyer

Regarding this specific data issue, I just want to make sure- the graph you are looking at is based on Energy Demand and GDP PPP, right? GDP PPP has a good coverage, so I don't think that's the problem. I believe energy demand in IFs is calculated by Production + Import - Export. These tables are from IEA. I picked two conventional energy types (coal and crude oil) and went to the lastest database to check for their coverage. For China, I do see data values from 1971 and onward for these tables (EnProdOilIEA & EnProdCoalIEA). For import and export, EnImOilIEA & EnExOilIEA for example, I also see values in the 70s. Unless energy demand is initialized through other data tables in IFs, we should have values for China all the way back to the 70s (same for Russia). This is probably not a data issue. But I could be wrong.

quciet commented 2 weeks ago

@jonathandmoyer and on your questions regarding the data pulling process, my quick answer is that we have some general rules that I will describe below but situations vary by data source/data table. Mistakes and issues happen but we try our best to document things to prevent that from happening again.

During the data vetting process, RAs check for several things 1) Countries that do not have values at all. This might be caused by country concordance or a change made by the data source. This is fairly easy to resolve.

2) Historical years that used to have data values in the old table but not anymore in the new table (like the WDI situation you described in the ticket). If no specific reasons are provided by the data source, RAs then check the consistency between values from the old table and values from the new table. If values are generally consistent across years, then values from the old table are merged into the new table. If not, we normally overwrite the old table. An exception is when the data source changes the methodology. Then we preserve the old table (normally by adding Old as the suffix in the table name) and import the new table. Note that we did not do this for the FAO food production table even though they changed their methodology in 2013.

3) Big year to year jump or big discrepancies in the same country-year between old and new data. This one is even harder to solve because some might just be data errors (percentages exceeding 100%) or some countries just have bad data. Our general rule is to take whatever are provided by the data source unless a developer is highly against it and want to have some values manually put in. For detecting impacts on the model, we used to put the data into IFs and rebuild the base to see changes. But I only have two people in the team now and simply don't have that capacity anymore.

As for the IHME upate, I failed to communicate the changes we made to the death cause mapping with the modeling pod, that's my fault. For tables used by historical analog, I was not aware of those tables and will update them. Thanks!

jonathandmoyer commented 2 weeks ago

Howdy:

Before you update them, I don't think we have an agreement on what should be included in the mental health category. I think the last email was from me to you about neurological disorders.

Hopefully Jose can look at the energy intensity question because those values had shown up previously.

Thank you

Jonathan D Moyer, PhD Associate Professor Director Frederick S. Pardee Institute for International Futures Josef Korbel School of International Studies University of Denver

On Fri, Jul 5, 2024 at 12:55 PM Yutang @.***> wrote:

@jonathandmoyer https://github.com/jonathandmoyer and on your questions regarding the data pulling process, my quick answer is that we have some general rules that I will describe below but situations vary by data source/data table. Mistakes and issues happen but we try our best to document things to prevent that from happening again.

During the data vetting process, RAs check for several things

1.

Countries that do not have values at all. This might be caused by country concordance or a change made by the data source. This is fairly easy to resolve. 2.

Historical years that used to have data values in the old table but not anymore in the new table (like the WDI situation you described in the ticket). If no specific reasons are provided by the data source, RAs then check the consistency between values from the old table and values from the new table. If values are generally consistent across years, then values from the old table are merged into the new table. If not, we normally overwrite the old table. An exception is when the data source changes the methodology. Then we preserve the old table (normally by adding Old as the suffix in the table name) and import the new table. Note that we did not do this for the FAO food production table even though they changed their methodology in 2013. 3.

Big year to year jump or big discrepancies in the same country-year between old and new data. This one is even harder to solve because some might just be data errors (percentages exceeding 100%) or some countries just have bad data. Our general rule is to take whatever are provided by the data source unless a developer is highly against it and want to have some values manually put in. For detecting impacts on the model, we used to put the data into IFs and rebuild the base to see changes. But I only have two people in the team now and simply don't have that capacity anymore.

As for the IHME upate, I failed to communicate the changes we made to the death cause mapping with the modeling pod, that's my fault. For tables used by historical analog, I was not aware of those tables and will update them. Thanks!

— Reply to this email directly, view it on GitHub https://github.com/PardeeCenterDU/IFs-Issues-Tracking/issues/325#issuecomment-2211291310, or unsubscribe https://github.com/notifications/unsubscribe-auth/AUNOSOQPY4YUIPVSG5PDC5LZK3T3TAVCNFSM6AAAAABKNQTC6SVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMJRGI4TCMZRGA . You are receiving this because you were mentioned.Message ID: @.***>

PardeeCenterIFs commented 1 week ago

On the energy intensity, it's a large formula, and one element can null out the entire year for a specific country. In the case of China that seems to be EnProdBiodieselIEA, which starts its data in 2005, so everything before that is Null because Biodiesel is null. This is the formula for ENDEM: image

I'm not sure when is it that we had data for this, but at least in the web version (8.06) we have the same situation where data starts to show in 2005.

quciet commented 1 week ago

On the energy intensity, it's a large formula, and one element can null out the entire year for a specific country. In the case of China that seems to be EnProdBiodieselIEA, which starts its data in 2005, so everything before that is Null because Biodiesel is null. This is the formula for ENDEM: image

I'm not sure when is it that we had data for this, but at least in the web version (8.06) we have the same situation where data starts to show in 2005.

Thanks José! Do you think we can update the function to treat nulls as 0s in this energy demand formula?

PardeeCenterIFs commented 1 week ago

This is how it looks using nulls as 0s for that formula, firtst ENDEM: ENDEMhf

second using flex displays as in the first image: ENDEMRelGDPhf

quciet commented 1 week ago

This is how it looks using nulls as 0s for that formula, firtst ENDEM:

second using flex displays as in the first image:

I'm not seeing imgs.

PardeeCenterIFs commented 1 week ago

Having trouble uploading them, still trying.

PardeeCenterIFs commented 1 week ago

Now you should see them, for some reason it wouldn't allow me to upload them while I was on VPN.

PardeeCenterIFs commented 1 week ago

Jonathan, could you please confirm you're ok with this change so that I make it permanent for the next installation.

jonathandmoyer commented 1 week ago

Who is the energy intensity blue dot in 2020 at the top

PardeeCenterIFs commented 1 week ago

That is Turkmenistan at 7% of GDP.

jonathandmoyer commented 1 week ago

OK--I guess the only way to resolve some of these issues (the transients) is to make a broad estimate to fill in nulls? Right? it looks like perhaps the soviet union in the blue line for energy demand in the 80s? Is that actual energy being removed or a groupings issue (more countries in the soviet union than russia)

PardeeCenterIFs commented 1 week ago

If you mean the option that we have in the menu to fill holes, that only works for groups, which is not the case of the graph we're showing here. If you mean replacing nulls with estimations in the data, I guess that would do it, but we probably don't want to do that.

Yes, that's the Soviet Union/Russia in the blue line, I guess that the question about energy being removed or groupings is a data question for Yutang.

quciet commented 1 week ago

Hi @PardeeCenterIFs @jonathandmoyer The table below shows the total energy production for all USSR countries (the columm on the right represents the sum of 15 countries). As you can see, from 1989 to 1990, the grouping issue happened- total energy production from USSR was distributed to 15 countries during the dissolution while the sum remained almost same. Then Russia's production went down in the following years as well. image

Following the formula of energy demand, I then checked energy trade. As you can expect, Russia's net energy export had a surge from 1989 to 1990, and gradually went down. image

jonathandmoyer commented 1 week ago

Are you saying you just made this change? Or something else?

quciet commented 1 week ago

Are you saying you just made this change? Or something else?

Is this for me or José? I was just checking the data for the Russia druing 1985-1995 in case you want to know what are underlining data behind the blue line. I did not make any change to the model or data.

jonathandmoyer commented 1 week ago

I see I totally misread this.

I'm still on the fence about including this because of the complexity of reconciling these different series with nulls and geographic transitions (Soviet to Russia). But I think if it's used appropriately, it's beneficial. Perhaps we just include it and revisit the complicated bits at a later stage.

quciet commented 1 week ago

I see I totally misread this.

I'm still on the fence about including this because of the complexity of reconciling these different series with nulls and geographic transitions (Soviet to Russia). But I think if it's used appropriately, it's beneficial. Perhaps we just include it and revisit the complicated bits at a later stage.

Understood. One thing I can think of is that, if included, we need to make sure countries that have no values at all should yield nulls instead of 0s. But of course, there could be other complications (hopefully not for this formula....)

PardeeCenterIFs commented 1 week ago

That's currently not possible, it's either nulls as 0s, or nulls that null out the entire formula for that year, I don't have an easy way to check that all members of the formula are nulls before doing the change.