pjuergens commented 3 years ago

Please confirm that this PR has done the following:

[x] Tests Added
[x] Documentation Added
[x] Name of contributors Added to AUTHORS.rst
[x] Description in RELEASE_NOTES.md Added

Description of PR

Add an option to skip existing intermediate variables when aggregating recursivly as discussed in #525

codecov[bot] commented 3 years ago

Codecov Report

Merging #532 (25edae0) into main (8d8aa6b) will increase coverage by 0.0%. The diff coverage is 100.0%.

@@          Coverage Diff          @@
##            main    #532   +/-   ##
=====================================
  Coverage   93.5%   93.5%           
=====================================
  Files         47      48    +1     
  Lines       5228    5257   +29     
=====================================
+ Hits        4891    4920   +29     
  Misses       337     337

Impacted Files	Coverage Δ
pyam/_aggregate.py	`99.0% <100.0%> (+<0.1%)`	:arrow_up:
pyam/_compare.py	`100.0% <100.0%> (ø)`
pyam/core.py	`92.6% <100.0%> (-0.1%)`	:arrow_down:
tests/conftest.py	`100.0% <100.0%> (ø)`
tests/test_feature_aggregate.py	`98.8% <100.0%> (+<0.1%)`	:arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 8d8aa6b...25edae0. Read the comment docs.

pjuergens commented 3 years ago

I don't understand why stickler-ci is failing as it doesn't tell me the line where the PR differs from the black code style

danielhuppmann commented 3 years ago

You can directly commit my suggestion to your branch directly, this should appease Stickler. Going forward, I recommend that you install a Black linter utility on your machine and follow this workflow:

write the code
commit the changes
run the linter
commit the changes, looking carefully at what parts of the code were changed by the linter
then push to GitHub

After a while, you'll not need to run the linter anymore because you'll be coding black naturally...

pjuergens commented 3 years ago

thanks, that did the job :)

danielhuppmann commented 3 years ago

Looking at the code implementation in a bit more detail, two thoughts:

having thought about this feature again, I think it makes more sense to not add an extra kwarg skip_intermediate but instead do the following:
- if an intermediate variable exists, check that the values are in line with the aggregate of its components (rather than simply skipping the computation) to ensure internal consistency along the variable hierarchy
- if it doesn't exist, compute the aggregate of the components and append the data
I'm a bit concerned with the proposed implementation about unexpected behavior if a user has an IamDataFrame with multiple scenarios, where one scenario (scen_a) has some intermediate variables and another (scen_b) has only the lower-level components. With the proposed implementation, scen_b would not be complete.

The (in my opinion expected) behavior could be tested by adding the following line
```
df2.aggregate("Secondary Energy|Electricity|Wind", append=True)
```
in the function test_aggregate_recursive() before appending df2 to df.

This way, the existing test could be modified and you wouldn't need to add an almost-duplicate test.

pjuergens commented 3 years ago

I changed the implementation from comparing variables beforehand to comparing indices after aggregating, which should now be able to deal with different variable sets in different scenarios, regions or even years.

Concerning your first thought: I'd prefer to not check internal consistency while skipping intermediate variables but rather let the user check it afterwards. When explicitly using this feature it should be clear that the intermediate variables are not aggregated by pyam and should be checked. Running check_internal_consistency then gives the user more information about which variables differ and if it's due to rounding errors. In my opinion more useful for potential debugging.

pjuergens commented 3 years ago

It seems like stickler uses slightly different definitions of black than the built-in black-linter in Spyder.

danielhuppmann commented 3 years ago

Thanks for continuing to work on this, sorry for the issues with difference versions of black...

Re your comment:

When explicitly using this feature it should be clear that the intermediate variables are not aggregated by pyam and should be checked.

There is a tradeoff here:

advanced users will know what they are doing - for them, the increased efficiency of skipping validation is a bonus
for non-expert users, pyam should have sensible guardrails to prevent them from doing stupid things or running into unexpected behavior - for them, pyam should raise loud and specific error messages

And either way, first running recursive-aggregation and the checking consistency is not an efficient approach, because it requires computing a lot of data twice.

So let me modify my earlier suggestion:

drop the skip_intermediate keyword argument - skipping should be the default, but with validation
allow recursive to be a boolean (as it is now) or a string "skip-validate" - and skip the validation in that case. So your expert-level function could be called with
```
df.aggregate(<variable>, recursive="skip-validate")
```
but novice users would get an error if the hierarchy is not consistent. In core.py, you only have to change if recursive is True: to if recursive:, then the recursive-aggregation will be selected both if it's True or a dictionary.

Later extensions could then also do something like recursive={rtol=0.2} where the contents of the dictionary are passed to np.isclose() in the validation to avoid errors if data is almost identical but not quite.

pjuergens commented 3 years ago

I implemented my approach for the aggregation check. I will change the interface with dropping the skip_intermediate keyword after the weekend.

IAMconsortium / pyam

Add an option to skip existing intermediate variables when aggregating recursivly #532

Please confirm that this PR has done the following:

Description of PR

Codecov Report