IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
234 stars 120 forks source link

Add `get_data_column()`, refactor filtering by the time domain #562

Closed danielhuppmann closed 3 years ago

danielhuppmann commented 3 years ago

Please confirm that this PR has done the following:

Description of PR

This PR adds a utility function get_data_column(name) as short-hand (and more efficient implementation) for df.data[name], because it avoids casting the internal _data pd.Series to a pd.DataFrame.

This utility function is then used to make filtering by the time domain more performant.

codecov[bot] commented 3 years ago

Codecov Report

Merging #562 (4407f85) into main (666da23) will increase coverage by 0.0%. The diff coverage is 100.0%.

Impacted file tree graph

@@          Coverage Diff          @@
##            main    #562   +/-   ##
=====================================
  Coverage   93.7%   93.7%           
=====================================
  Files         50      50           
  Lines       5322    5332   +10     
=====================================
+ Hits        4987    4997   +10     
  Misses       335     335           
Impacted Files Coverage Δ
pyam/core.py 94.3% <100.0%> (+<0.1%) :arrow_up:
tests/test_core.py 100.0% <100.0%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 666da23...4407f85. Read the comment docs.

danielhuppmann commented 3 years ago

While working on this PR, I noticed that pyam also has a getter df[column], which returns the same as the new function df.get_data_column(column) (if column is a dimension of data).

I see three options:

Any thoughts @gidden or anyone else?

gidden commented 3 years ago

For reference, the implementation is reproduced below:

    def __getitem__(self, key):
        _key_check = [key] if isstr(key) else key
        if set(_key_check).issubset(self.meta.columns):
            return self.meta.__getitem__(key)
        else:
            return self.data.__getitem__(key)

In which case if get_data_column() is indeed faster than self.data.__getitem__(key) and is guaranteed to have the same outcome in all situations, then I would simply replace the last line here.

I would advise against deprecation, as this is a pretty innocuous feature and we have no idea how many users are actually using it. Not to mention, it would be almost impossible to update legacy code...

danielhuppmann commented 3 years ago

I would advise against deprecation, as this is a pretty innocuous feature and we have no idea how many users are actually using it. Not to mention, it would be almost impossible to update legacy code...

Don't quite agree that it's so innocuous... If you do df["region"], it's intuitive what you get (the column from df.data), if you do df["category"], it's also intuitive (the column from df.meta, if it exists). But what do you get if you do df["model"]?

But if you think a DeprecationWarning is too strong, how about a FutureWarning - so this will give users probably a year's worth of time to update any code...?

gidden commented 3 years ago

I would strongly prefer not removing/deprecating. Have we heard anyone who has raised an issue with this?

danielhuppmann commented 3 years ago

Ok, let's leave it - wil simply implement the more efficient approach.

gidden commented 3 years ago

Great, thank you!

danielhuppmann commented 3 years ago

Implemented the more efficient approach for the direct getter IamDataFrame[<column>] and extended the corresponding unit test for good measure...

gidden commented 3 years ago

Looks great, thanks!