ProjectDrawdown / solutions

The mission of Project Drawdown is to help the world reach “Drawdown”— the point in the future when levels of greenhouse gases in the atmosphere stop climbing and start to steadily decline, thereby stopping catastrophic climate change — as quickly, safely, and equitably as possible.
https://www.drawdown.org/
Other
214 stars 91 forks source link

Pandas 1.0 support #4

Closed DentonGentry closed 4 years ago

DentonGentry commented 4 years ago

For most of the time spent developing Pandas, we have been using 0.25.x releases. Pandas 1.0 is now available, and has turned a number of deprecation warnings into errors. We've corrected most of them, but there are at least a few errors remaining in model/tam.py. Using conda or pipenv clients to be able to switch back and forth from 0.25.x to 1.0.x should allow all remaining hiccups with Pandas 1.0 support to be resolved.

DentonGentry commented 4 years ago

The main problem remaining appears to be due to regional data sources.

data_sources = {
    'Baseline Cases': {
        'A': str(datadir.joinpath('tam_all_one.csv')),
        'B': str(datadir.joinpath('tam_all_two.csv')),
    },
    'Region: OECD90': {
        'Baseline Cases': {
            'C': str(datadir.joinpath('tam_all_three.csv')),
        },
    },
}

model/interpolation.py:matching_data_sources() for "ALL SOURCES" will return ["A", "B"] as it should but will look inside the "Region: OECD90" and also return "Baseline Cases". There is no source named Baseline Cases, and so Pandas asserts due to passing a non-existent column name to .loc[]

tpltnt commented 4 years ago

Is there a test case with the expected behaviour so we can see/check when the issue is solved?

tpltnt commented 4 years ago

I looked at the documentation of matchon_data_sources() It states that "data_sources: a dict() of group names which contain dicts of data source names." An example is

           {
             'Ambitious Cases': {'Study Name A': 'filename A', 'Study Name B': 'filename B', ...}
             'Baseline Cases': {'Study Name C': 'filename C', 'Study Name D': 'filename D', ...}
             'Conservative Cases': {'Study Name E': 'filename E', 'Study Name F': 'filename F', ...}
           }

The keys of the outer dict are the names of groups ( e.g. 'Ambitious Cases' in this case). The keys of the inner dicts are the names of individual data sources (e.g. 'Study Name B' in the group 'Ambitious Cases'). I am trying to build test cases so I wonder:

  1. What is 'Region: OECD90' in the example given by @DentonGentry ?
  2. Is it another group? Or a data source?
  3. Why the change/shift in hierarchy?
  4. Is the (JSON) format documented somewhere?

Any help/hint is apprechiated.

DentonGentry commented 4 years ago

The initial implementation of the data sources was just one level of hierarchy of groups:

{
    'Ambitious Cases': {'Study Name A': 'filename A', 'Study Name B': 'filename B', ...},
    'Baseline Cases': {'Study Name C': 'filename C', 'Study Name D': 'filename D', ...},
    'Conservative Cases': {'Study Name E': 'filename E', 'Study Name F': 'filename F', ...},
}

A particular solution, like Solar Farms, might use the Ambitious Cases data sources because solar photovoltaics have grown robustly for many years. A solution like Concentrated Solar (i.e. mirrors focussing the sun on a reservoir of molten salt) might use the Conservative Cases to reflect that the growth rate has not been so strong.

The model produces outputs at the level of:

In the initial implementation it was assumed that the set of sources would be the same for all regions, that 'Study Name A' for example would be a source used in all regions. For the energy solutions this was true, the IEA and other sources produce results for all over the world.

However this isn't true for all solutions, particularly agricultural solutions. Some use one set of sources for the world, a different set of sources for the Middle East and Africa, etc.

Additionally, though the model currently has its regions at the scale of the whole planet, there is a desire to not hard-code this and allow the model to eventually support running where, for example, the US or India is the top level and the major regions are within that country.

So the implementation added a level of hierarchy:

{
    'Ambitious Cases': {'Study Name A': 'filename A', 'Study Name B': 'filename B', ...},
    'Baseline Cases': {'Study Name C': 'filename C', 'Study Name D': 'filename D', ...},
    'Conservative Cases': {'Study Name E': 'filename E', 'Study Name F': 'filename F', ...},
    'Region: OECD90': {
        'Ambitious Cases': {'Study Name A': 'filename A', 'Study Name G': 'filename G', ...},
        'Baseline Cases': {'Study Name H': 'filename H', 'Study Name D': 'filename D', ...},
        'Conservative Cases': {'Study Name E': 'filename E', 'Study Name I': 'filename I', ...},
    }
}

If a specific "Region: Name" exists it will be used for that region, otherwise the top-level sources used for the World will be used for that region.

On the whole I think this works, in that it allows region-specific sources when needed but allows the simple case of one set of sources, and it does not make assumptions about the regions which will be preset.

One problem, which this issue concerns, is in handling "ALL SOURCES" (which many solutions use if they just want an average from all available sources). In the original structure:

{
    'Ambitious Cases': {'Study Name A': 'filename A'}
    'Baseline Cases': {'Study Name B': 'filename B'}
    'Conservative Cases': {'Study Name C': 'filename C'}
}

It would return {'Study Name A', 'Study Name B', 'Study Name C'}.

If there are regions present:

{
    'Ambitious Cases': {'Study Name A': 'filename A'}
    'Baseline Cases': {'Study Name B': 'filename B'}
    'Conservative Cases': {'Study Name C': 'filename C'}
    'Region: OECD90': {
        'Ambitious Cases': {'Study Name D': 'filename D'}
    }
}

It returns {'Study Name A', 'Study Name B', 'Study Name C', 'Ambitious Cases'}.

This worked before because in Pandas 0.28 though there was no column named 'Ambitious Cases', it would emit a DeprecationWarning but continue on. This causes an error with Pandas 1.0 because .loc[] for a non-existent column is now an error.

tpltnt commented 4 years ago

Hmm ... so the regions example

{
    'Ambitious Cases': {'Study Name A': 'filename A'}
    'Baseline Cases': {'Study Name B': 'filename B'}
    'Conservative Cases': {'Study Name C': 'filename C'}
    'Region: OECD90': {
        'Ambitious Cases': {'Study Name D': 'filename D'}
    }
}

should return ['Study Name A', 'Study Name B', 'Study Name C']? To me "all sources" would imply ['Study Name A', 'Study Name B', 'Study Name C', 'Study Name D']. Which one is the desired outcome?

(I am asking because both cases break a bunch of tests.)

DentonGentry commented 4 years ago

model/interpolation.py:matching_data_sources() was also implemented before we realized that sometimes the data sources for a region will be different than for the World. We've gotten away with it so far because only the World region uses "ALL SOURCES", but really I think it would be better if matching_data_sources() took an argument of what region it was looking for.

If this is added as an argument, then:

{
    'Ambitious Cases': {'Study Name A': 'filename A'}
    'Baseline Cases': {'Study Name B': 'filename B'}
    'Conservative Cases': {'Study Name C': 'filename C'}
    'Region: OECD90': {
        'Ambitious Cases': {'Study Name D': 'filename D', 'Study Name E': 'filename E'}
    }
}

When looking for the World region, it should return ['Study Name A', 'Study Name B', 'Study Name C']

When looking for the OECD90 region, it should return ['Study Name D', 'Study Name E']

When looking for the 'Latin America' region, as there is no 'Region: Latin America', it would fall back to the top level sources and return ['Study Name A', 'Study Name B', 'Study Name C']

Does that sound reasonable?