IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
236 stars 120 forks source link

Question: Is there a multivariate filtering feature? #848

Closed pkufubo closed 7 months ago

pkufubo commented 7 months ago

Hello pyam community,

First of all, thank you for your ongoing efforts in developing and maintaining the pyam package.

I'd like to discuss the potential for an enhanced feature related to multivariate filtering. In my recent projects, I've encountered where it would be beneficial to select data that meets multiple variable criteria simultaneously. For instance, I often need to select scenarios that include both 'Emissions|CH4' and 'Emissions|NOx'.

Currently, I manually filter for each variable and then compute the intersection of these filters to find scenarios that include all specified variables. This process can be quite cumbersome and error-prone, especially with a large number of variables.

Is there an existing feature that simplifies this process? If not, I believe adding a multivariate filtering feature that allows users to specify multiple variables and returns scenarios containing all these variables would be extremely helpful. Such a feature would enhance the usability and efficiency of data handling within the pyam framework.

Thank you for considering this suggestion. I look forward to your cooments and any potential updates.

Best regards, Bo Fu

danielhuppmann commented 7 months ago

Thank you, @pkufubo, for reaching out.

If I understand your question correctly, then the simple answer is that you can use a list as filter-argument, i.e.

df.filter(variable=["Emissions|CH4", "Emissions|NOx"])

This is documented in the slice() method here, which is used by the filter() method.

danielhuppmann commented 7 months ago

@pkufubo, did my suggestion answer your question? If yes, please close this issue, or clarify.

pkufubo commented 7 months ago

@danielhuppmann Thank you for your response, and my apologies for the delay in getting back to you. I'm afraid I haven't made my requirements clear. I want to select the scenarios where both "Emissions|CH4" and "Emissions|NOx" are available. If only one or zero variable is supplied, I would like to exclude that scenario from consideration.

The code you give df.filter(variable=["Emissions|CH4", "Emissions|NOx"]) seems to find the union of df.filter(variable=["Emissions|CH4"]) and df.filter(variable=["Emissions|NOx"]).

I wrote a script to meet such requirements, but it's long.

variable_list = ['Emissions|CH4','Emissions|NOx']
model_scenario_list = []
for model,scenario in df.index:
    df_filter = df.filter(model=model, scenario=scenario)
    if 0 in [len(df_filter.filter(variable=var)) for var in variable_list ]:  ##if any varible is missing, drop the model-scenario
        continue
    else:
        model_scenario_list .append(df_filter) # the model-scenario selected
data_sel = pyam.concat(model_scenario_list )

Does pyam has an API for selection like this? Thank you again.

danielhuppmann commented 7 months ago

Thanks for the clarification, indeed, there is an easier option:

variable_list = ['Emissions|CH4', 'Emissions|NOx']
df.require_data(variable=variable_list, exclude_on_fail=True)
df_sel = df.filter(exclude=False)

See the docs of df.require_data() for more info.

Also, you code could be simplified like:

  if all([v in df_filter.variable for v in variable_list]): 
      model_scenario_list.append(df_filter)
pkufubo commented 7 months ago

Thank you for your kind rely. My question has perfectlly answered and I will close this issue.