IAMconsortium / nomenclature

A package to work with IAMC-style variable templates
https://nomenclature-iamc.readthedocs.io/
Apache License 2.0
19 stars 14 forks source link

Allow wildcard-items in Code #432

Open danielhuppmann opened 6 days ago

danielhuppmann commented 6 days ago

To allow more flexibility for reporting of technical parameters, we want to allow "wildcard-codes" echoing the wildcard-implementation in pyam using *.

Concept: a VariableCode can be defined as

- Capital Cost|Electricity|Coal|*:
    description: Technology-specific capital cost of a newly installed plant to generate
      electricity from coal
    unit: USD_2010/kW

The validation-method should then accept any variable that matches the code-name including any string for the wildcard.

This can follow the implementation by @phackstock here https://github.com/IAMconsortium/nomenclature/blob/f210213ccf51e8f70e70cd3e1715f273cb100f0c/nomenclature/config.py#L34

To be explicit, any of the following variables should pass validation:

danielhuppmann commented 5 days ago

Maybe this was already implemented in #397, please double-check.

dc-almeida commented 5 days ago

Indeed I was checking that today with some tests, will confirm tomorrow

danielhuppmann commented 4 days ago

Follow-up because I did some tests myself: wildcard * in variable names work, but the units are not checked. Plus there may be some difficulties here because there may be multiple possibly matching VariableCode items for a variable, e.g.,

- Capital Cost|Hydrogen|*:
    description: ...
    unit: USD_2010/kW
- Capital Cost|Hydrogen|Fossil*:
    description: ...
    unit: EUR_2020/kW

Not saying that this makes sense, but if there is now a variable "Capital Cost|Hydrogen|Fossil|Coal" in an IamDataFrame, it's not clear which unit should apply...

phackstock commented 14 hours ago

@danielhuppmann, thanks for the checking. I also looked at the code in detail now and I think there's a couple of different ways we could go about the issue of unit ambiguity that you mentioned.

  1. Skip the unit check for any wildcard variable (simple but potentially dangerous and confusing later down the line)
  2. Add additional units to cover all options
  3. As an additional check for a VariableCodeList itself (without any input data), make sure that no variable pattern matches anything inside the code list. This way the matches are unambiguous and we could enforce specific units after all.

In the interest of keeping patterns as simple as possible and avoid ambiguity as much as possible I'd suggestion option 3.

danielhuppmann commented 14 hours ago

3 is a nice idea, but probably takes a bit more time to implement.

So I suggest to implement a simple "if the variable to be validated matches the wildcard-codelist, the unit must match" (which might cause issues in corner cases but probably not that relevant in practice anyway).

Then add a sanity-check to be called during validate-project that wildcard-codes use not have well-defined duplicates.

phackstock commented 13 hours ago

Then add a sanity-check to be called during validate-project that wildcard-codes use not have well-defined duplicates.

I have read your suggestion a couple of times now but I fail to understand how that is different to my point 3. What I was describing as this additional check is what I believe you are calling "sanity-check". I'd implement it as a (surprise, surprise) pydantic validator for VariableCodeList. You'd check if any wildcard variable matches any other wildcard variable.

danielhuppmann commented 13 hours ago

Sorry for not being clear. Parsing a DataStructureDefinition for large projects is already taking quite some time, so adding yet another pydantic-validator (executed every time) might not be the smartest move.

Hence my suggestion to implement that as a validation-method that is not executed when initializing the DataStructureDefinition but only as part of the validate-project CLI (so for example as part of GitHub Actions in a workflow repository).

phackstock commented 13 hours ago

Parsing a DataStructureDefinition for large projects is already taking quite some time, so adding yet another pydantic-validator (executed every time) might not be the smartest move.

Without having run any benchmarks on that, doesn't reading in data, which we usually do when using nomenclature, typically take order(s) of magnitude longer? Where is the performance of the validators an issues currently? Do you mean in the scenario processing, in the testing of PRs, running locally, ...?