Closed danielhuppmann closed 1 year ago
Is there really a logical or being applied? I just ran the following to double check:
import pandas as pd
from pyam import IamDataFrame, IAMC_IDX
df = IamDataFrame(
pd.DataFrame(
[
["model_a", "scen_a", "World", "Primary Energy", "EJ/yr", 1],
],
columns=IAMC_IDX + [2020],
)
)
print(df.require_data(variable=["Primary Energy", "Primary Energy|Gas"]))
# returns model_a, scen_a indicating that there are variables missing
and I get a return for model_a
, scen_a
which according to the docs (https://pyam-iamc.readthedocs.io/en/stable/api/iamdataframe.html#pyam.IamDataFrame.require_data) means that for those indices the criteria are not satisfied.
Granted the output could be more precise, indicating that only Primary Energy|Gas
is missing, but in principle the logical operation currently used is an and, right?
So in the above example from the pathways-ensemble-analysis tool just running:
df.require_data(variable=list_of_variables)
should work in my view. It might not be super useful though as you might be checking a list of ten variables and only one is missing. You would get an error in this case but no indication how many or which variables are missing.
To be more precise, indeed all
items are required per dimension, so as you correctly point out, the method will fail if a variable from that list is not included.
This issue discusses how to deal with multiple dimensions.
df.require_data(
variable=["Primary Energy", "Primary Energy|Gas"]
region=["Austria", "Germany"]
)
this will currently pass if there is (only) data of "Primary Energy" for Austria and "Primary Energy|Gas" for Germany.
However, I think a more common use case would be to require that both variables exist for both regions...
Ah right, ok that makes sense. That would be a nice addition indeed. It should be more in line with the principle of least surprise, as in the above example that you gave it's most likely the user wants to ensure that both variables are present for both regions than anything else. The reason I got confused was that in the first example you just had a list of variables:
def require_variables(df, list_of_variables):
for v in list_of_variables:
df.require_data(variable=v)
in this case it should still work by running just df.require_data(variable=list_of_variables)
. Unless there are of course additional region requirements.
True - my example was too stylized, sorry...
No problem, all good. I'd also strongly support to make and/all the default. Maybe while we're at it, we could also change the output to report more precisely what is missing. So instead of just model, scenario report model, scenario, variable, region, ... just as many dimensions as needed to uniquely identify what is actually missing.
So I just played around a bit and realized that I already implemented require_data()
with the logical-and approach...
But I'll see if I can improve the return-object.
Background
The method
reuqire_data()
applies a logical-or for each dimension, so a scenario is ok if any variable/region from a given list is present.We should add a keyword argument "method" or "logical" that allow a user to select whether any or all elements should be present.
Context
In @l-welder & @coroa's pathways-ensemble-analysis tool, the following code is used (simplified):
The following would be much more straightforward:
Default usage
I think that
how="all"
is the more common use case, so we could set the default toNone
(for pyam ^2.0) with a warning that all is used by default. In pyam 3.0, we can change the default to all and remove the warning.@phackstock