IAMconsortium / nomenclature

A package to work with IAMC-style variable templates
https://nomenclature-iamc.readthedocs.io/
Apache License 2.0
17 stars 14 forks source link

Allow attribute filtering in nomenclature.yaml for importing definitions form external repo #326

Open phackstock opened 8 months ago

phackstock commented 8 months ago

When importing from an external repository we should be able to filter by attributes. This way we don't import the whole definition if it's not needed. The first three use cases that come to mind are:

The question would be how to integrate this into the existing nomenclature.yaml structure. I've tried a few things now and this is my current favorite:

repositories:
  common-definitions:
    url: https://github.com/IAMconsortium/common-definitions.git/
definitions:
  region:
    repository: common-definitions
    repository-filters:
      hierarchy: R5
  variable:
    repository: common-definitions
    repository-filters:
      name: Final Energy*
    country: true

This would import all R5 regions from common-definitions and all variables starting with Final Energy*.

The above format would also allow for more complex filtering such as:

repositories:
  common-definitions:
    url: https://github.com/IAMconsortium/common-definitions.git/
  legacy-definitions:
    url: https://github.com/IAMconsortium/legacy-definitions.git/
definitions:
  variable:
    repository: common-definitions
    repository-filters:
      - repository: common-definitions
        tier: 1
      - name: Final Energy*
    country: true

here we have multiple filters for the variable dimension:

  1. We take all variables from common-definitions that have the attribute tier with the value 1.
  2. We take all variables from common-definitions and legacy-definitions (no repository filter) that match the pattern Final Energy*

Would love to hear your thoughts @danielhuppmann, @dc-almeida.

danielhuppmann commented 2 months ago

This looks great, but I'm wondering about two issues.

  1. Wouldn't be more intuitive to have the filters as an attribute of the repository, instead of repeating the repository-attribute many times?
  2. Not clear whether the list of filters would work as AND or OR...?

See a more explicit

definitions:
  variable:
    repository:
      common-definitions:
        filters:
          - name: Primary Energy*
            tier: 1
          - name: Final Energy*

to get all final-energy variables and only primary-energy-variables at tier 1.

phackstock commented 2 months ago

Good points.

Regarding your first point, you're right, it does look better to me as well. The reason I did intentionally opt against it in my proposed structure is that this would require bigger changes to the code. Nothing crazy but more difficult to implement than just adding another attribute at the repository level. I do agree though that it's nicer that way.

For your second point, I'd take your example exactly the way you suggested. Meaning that within a filter entry it's an AND and between filters it's an OR.

One point that's remaining is to cover is if we allow lists as filter values, and if so how they're evaluated:

definitions:
  variable:
    repository:
      common-definitions:
        filters:
          - name: Primary Energy*
            tier: [1, 2]

i.e. would the above translate to: "Everything that starts with Primary Energy* and has the tier attribute [1, 2]" or "Everything that starts with Primary Energy* and has the tier attribute 1 or 2". In this example only the latter makes sense but there might be attributes where we actually want to match a list.

Alternatively, we could also only allow for single values, so if you wanted to achieve the above you'd have to use:

definitions:
  variable:
    repository:
      common-definitions:
        filters:
          - name: Primary Energy*
            tier: 1
          - name: Primary Energy*
            tier: 2

in this example we could even allow for list values but then they have to match exactly.

danielhuppmann commented 2 months ago

I guess we will quickly run into a use case like "give me primary energy, final energy, CO2 emissions, GDP, ..." from an upstream-repo, so repeating the "name" attribute many time will be tedious. So I would say the following logic makes most sense:

phackstock commented 2 months ago

Sounds good, that should cover what we need. I cannot think of a use case where we'd need to explicitly match for a list anyway.