IAMconsortium / pyam

Analysis & visualization of energy & climate scenarios
https://pyam-iamc.readthedocs.io/
Apache License 2.0
231 stars 119 forks source link

How to filter for variables containing substring at arbitrary positions? #593

Closed stefaneidelloth closed 2 years ago

stefaneidelloth commented 2 years ago

Lets assume I have many variables and that some contain the key word "Total" at several positions:

foo|Total
foo|baa|Total
foo|baa|Total|qux

=>How can I filter for all variables that contain a specific substring?

I tried to

a) use a regular expression ".(Total)." (can be checked with https://pythex.org/): that.

df.filter(variable=".*(Total).*")

b) use a filter function.

df.filter(variable = lambda variable: 'Total' in variable)

c) use pandas datframe filtering

df.loc[df['variable'].str.contains('Total')]

However, pyam does not seem to support those options.

Edit: As a workaroun,d I could get the data as pandas dataframe, filter it and wrap it again in a pyam data frame:

new_data = df.data.loc[df['variable'].str.contains('Oil')]
new_df = pyam.IamDataFrame(new_data)
new_df.timeseries()

If there is a better solution, please let me know.

stefaneidelloth commented 2 years ago

Following code works, using regular expression:

df.filter(variable=".*(Total).*", regexp= True)\
.timeseries()

I first did not understand what the doc says:

'regexp=True' disables pseudo-regexp syntax in pattern_match()

I would write instead:

'regexp=True' Enables regular expression syntax.

danielhuppmann commented 2 years ago

Thanks for raising this issue, glad that you managed to sort it out yourself.

In addition to the option regexp=True, which you found, you can also use simply

df.filter(variable="*(Total)*").timeseries()

i.e., * is a simple wildcard similar to .* in regexp.

PRs to improve the docs are always welcome.

By the way, we also have a Slack workspace with a helpdesk channel for these kinds of questions.

stefaneidelloth commented 2 years ago

If one wants to ensure that an expression occurs at a distinct level, following regular expression might be helpful:

df.filter(variable="([^(|)](|)){1}Total.", regexp=True)\ .timeseries()

=>Ensures that "Total" occurs at level 1