Add validation function that returns invalid items rather than logging+raising

korsbakken commented 10 months ago

The current processing workflow, and in particular the validation.validate function currently raises a ValueError and writes invalid item names to log if any item in any dimension is not contained in the corresponding codelist. Which is great for a pipeline that just validates input and rejects invalid input. But sometimes you rather want a function that returns the invalid items in a form that you can work with further in Python.

Could you add such a function? E.g., validate.get_invalid_items or something like that? Basically the same as validation.validate, but which returns, e.g., a dict with the dimension names as keys and lists of the invalid items as values or something similar. Codelist.validate_items already does that for a single dimension, so a workaround is to call that manually for each dimension. But using a single function to get the invalid items for all or for a specified list of dimensions would be a lot more convenient.

Variable/unit combos would be a special case here. If any of those fail the check, they could maybe be returned in a dict item with key 'invalid_units' and values that are tuples of (variable name, invalid unit), or something like that.

danielhuppmann commented 10 months ago

Thanks Jan Ivar for that comment. Can you explain a bit further what your vision for the "working further in Python with the invalid items" actually is? I'm a bit worried that defining a usable return-type is not obvious...

Also, as you write yourself, getting the invalid items is only one-and-a-half lines of code, like

for dim in dimensions:
    invalid = getattr(dsd, dim).validate_items(getattr(df, dim))
    ...

so you might as well include that line in your processing workflow rather than having to parse the returned object from a new validation-method.

korsbakken commented 10 months ago

Yes, that's what I would do for dimensions. But I had to search through the source code to find that solution. And it's not sufficient for checking variable/unit combos (see below).

A dedicated method would be more convenient for anyone who isn't already familiar with the codebase. For the dimensions, it could just return {dim: getattr(dsd, dim).validate_items(getattr(df, dim)) for dim in dimensions} I.e., the return type would be dict[str, str|int] (although alternative value types might be needed for custom dimensions).

Checking variable/unit combinations, is less straight-forward. The core part of validation.validate that performs that check is 12 lines long, and could be hard to come up with on your own without searching through the source code. It would be convenient to have a method that returns the variable invalid_units that results from those lines, rather than keep it internal and raise an error.

It could either be a separate item in the dict returned by the method that checks individual dimensions (e.g., with key 'invalid_units'. Or it could be returned by a separate method (e.g., .get_invalid_units), to avoid introducing a custom key like that in the dict. In the latter case, the return type would be dict[str, str], with the keys being variable names and the values being unit names.

If you want, I can add the necessary code myself and submit a pull request (I would probably make a branch and do that for my own use anyway). But in that case it would be good to get your feedback on the solution I've sketched here.

danielhuppmann commented 10 months ago

I'd still be curious to better understand your intended use case.

But yes, happy to review a PR and provide feedback - please create a fork and tinker away...

IAMconsortium / nomenclature

Add validation function that returns invalid items rather than logging+raising #302