IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
221 stars 115 forks source link

Support logical-and in `require_data()` #768

Closed danielhuppmann closed 10 months ago

danielhuppmann commented 10 months ago

Background

The method reuqire_data() applies a logical-or for each dimension, so a scenario is ok if any variable/region from a given list is present.

We should add a keyword argument "method" or "logical" that allow a user to select whether any or all elements should be present.

Context

In @l-welder & @coroa's pathways-ensemble-analysis tool, the following code is used (simplified):

def require_variables(df, list_of_variables):
    for v in list_of_variables:
        df.require_data(variable=v)

The following would be much more straightforward:

df.require_data(variable=list_of_variables, how="all")

Default usage

I think that how="all" is the more common use case, so we could set the default to None (for pyam ^2.0) with a warning that all is used by default. In pyam 3.0, we can change the default to all and remove the warning.

@phackstock

phackstock commented 10 months ago

Is there really a logical or being applied? I just ran the following to double check:

import pandas as pd
from pyam import IamDataFrame, IAMC_IDX

df = IamDataFrame(
    pd.DataFrame(
        [
            ["model_a", "scen_a", "World", "Primary Energy", "EJ/yr", 1],
        ],
        columns=IAMC_IDX + [2020],
    )
)

print(df.require_data(variable=["Primary Energy", "Primary Energy|Gas"]))
# returns model_a, scen_a indicating that there are variables missing

and I get a return for model_a, scen_a which according to the docs (https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.require_data) means that for those indices the criteria are not satisfied. Granted the output could be more precise, indicating that only Primary Energy|Gas is missing, but in principle the logical operation currently used is an and, right?

phackstock commented 10 months ago

So in the above example from the pathways-ensemble-analysis tool just running:

df.require_data(variable=list_of_variables)

should work in my view. It might not be super useful though as you might be checking a list of ten variables and only one is missing. You would get an error in this case but no indication how many or which variables are missing.

danielhuppmann commented 10 months ago

To be more precise, indeed all items are required per dimension, so as you correctly point out, the method will fail if a variable from that list is not included.

This issue discusses how to deal with multiple dimensions.

df.require_data(
    variable=["Primary Energy", "Primary Energy|Gas"]
    region=["Austria", "Germany"]
)

this will currently pass if there is (only) data of "Primary Energy" for Austria and "Primary Energy|Gas" for Germany.

However, I think a more common use case would be to require that both variables exist for both regions...

phackstock commented 10 months ago

Ah right, ok that makes sense. That would be a nice addition indeed. It should be more in line with the principle of least surprise, as in the above example that you gave it's most likely the user wants to ensure that both variables are present for both regions than anything else. The reason I got confused was that in the first example you just had a list of variables:

def require_variables(df, list_of_variables):
    for v in list_of_variables:
        df.require_data(variable=v)

in this case it should still work by running just df.require_data(variable=list_of_variables). Unless there are of course additional region requirements.

danielhuppmann commented 10 months ago

True - my example was too stylized, sorry...

phackstock commented 10 months ago

No problem, all good. I'd also strongly support to make and/all the default. Maybe while we're at it, we could also change the output to report more precisely what is missing. So instead of just model, scenario report model, scenario, variable, region, ... just as many dimensions as needed to uniquely identify what is actually missing.

danielhuppmann commented 10 months ago

So I just played around a bit and realized that I already implemented require_data() with the logical-and approach...

But I'll see if I can improve the return-object.