23andMe / Yamale

A schema and validator for YAML.
MIT License
680 stars 88 forks source link

[Enhancement] : Better error message with line number #117

Open GilShoshan94 opened 4 years ago

GilShoshan94 commented 4 years ago

Hi @mildebrandt and the other contributors. First thank you for this amazing library. I honestly think this should be the standard way to validate YAML schema. It would be amazing if you can reach "ruamel.yaml" and "pyyaml" so they could include your library and set the way to validate schema in YAML.

Anyway. I started using Yamale and found myself to wish to get the line number in the yaml file where I have errors. I am using ruamel.yaml, (I don't really now pyyaml but I guess it should be possible too). I use ruamel.yaml in round-trip mode and extract the line number of each keys and I copy your format (with the '.' for nested keys). I get a one level dict with the line number.

Here below my code with comments as well as an example with output. I hope you will find it a nice addition, ideally I see that you add the line numbers into yamale.YamaleError.results.errors

from pathlib import Path
from typing import Dict

import yamale
from ruamel.yaml import YAML
from ruamel.yaml.comments import CommentedMap

def _get_lc_dict_helper(data: CommentedMap, dict_key_line: Dict[str, int], parentkey: str = "") -> Dict[str, int]:
    """
    Recursive helper function to fetch the line infos of each keys in the config yaml file.

    Built to be called inside of `_get_lc_dict`.
    """
    sep = "."  # Don't modify, it is to match the "keys" return in the errors of the yamale lib.
    keys_indexes = None

    try:
        if len(data) > 0:
            keys_indexes = range(len(data))
    except TypeError:
        pass
    try:
        keys = data.keys()
        keys_indexes = keys
    except AttributeError:
        pass

    if keys_indexes is None:
        return dict_key_line  # return condition from recursion

    for key in keys_indexes:
        if parentkey != "":
            keyref = parentkey + sep + str(key)
        else:
            keyref = str(key)
        try:
            lnum = data.lc.data[key][0] + 1
            if keyref in dict_key_line:
                print(
                    f"WARNING : key '{keyref}' is NOT UNIQUE, at lines {dict_key_line[keyref]:>4} and {lnum:>4}."
                    f" (overwriting)."
                )
            dict_key_line[keyref] = lnum
            # print(f"line {lnum:<3} : {keyref}")
            _get_lc_dict_helper(data[key], dict_key_line, keyref)  # recursion
        except AttributeError:
            pass

    return dict_key_line

def _get_lc_dict(path: Path) -> Dict[str, int]:
    """
    Helper function to trace back the line number in the yaml file for each keys.

    Built to be called inside of `validate`.

    Parameters
    ----------
    path : Path
        Path to the config yaml file (not the schema).

    Returns
    -------
    Dict[str, int]
        Maps the keys to their line number, the line counter (lc).
        This dictionary is only 1 level and the keys corresponds to the ones report by the yamale lib.
    """
    dict_key_line: Dict[str, int] = {}
    with YAML(typ="rt") as yaml:
        for data in yaml.load_all(path):
            dict_key_line = _get_lc_dict_helper(data, dict_key_line)
    return dict_key_line

def validate(path_schema: Path, path_data: Path):
    """
    Validates the config yaml file according to the schema yaml file.

    Will be silent if good and will exit the program if there is an error,
    and will output an detailed error message to fix the config file.

    Parameters
    ----------
    path_schema : Path
        Path to the schema yaml file.
    path_data : Path
        Path to the config yaml file.
    """
    # Create a schema object
    schema = yamale.make_schema(path=path_schema, parser="ruamel")

    # Create a Data object
    config = yamale.make_data(path=path_data, parser="ruamel")
    # Validate data against the schema. Throws a ValueError if data is invalid.
    try:
        yamale.validate(schema, config)
        print("Validation success!👍")
    except yamale.YamaleError as e:
        errmsg = "Validation failed!\n"
        lc = _get_lc_dict(path_data)
        for result in e.results:
            title1 = "Schema"
            title2 = "Config"
            sep = f"{'-'*40}\n"
            errmsg += f"{title1:<10} : {result.schema}\n{title2:<10} : {result.data}\n{sep}"
            for error in result.errors:
                keyerr = error.split(":", 1)
                keypath = keyerr[0]
                err = keyerr[1]
                l_num = lc.get(keypath, "?")
                errmsg += f"* line {l_num:>4}:  {keypath:<40} : {err}\n"
            errmsg += f"{sep}"

        print(errmsg)
        exit(1)

Then in another file I use it like that :

curr_path = Path(__file__).parent
path_schema = (curr_path / "schema.yaml").resolve()
path_data = (curr_path / "data.yaml").resolve()

validate(path_schema=path_schema, path_data=path_data)

I guess the code can be improved, I tested quiet a lot but I admit I did not try special cases. But it should not crash as I handled errors, maximum you don't get the line number (just a '?'). I tried to be careful to use only "public" method from ruamel.yaml (CommentedMap.lc.data[key][0]). (I am on Python 3.8, ruamel.yaml 0.16.6 and yamale 3.0.1)

Here below an example: schema.yaml:

list_with_two_types: list(str(), include('variant'))
questions: list(include('question'))
---
variant:
  rsid: str()
  name: str()
# Comment 1
question:
  choices: list(include('choices')) # Comment 10
  questions: list(include('question'), required=False)

choices:
  id: str()
---
variant2:
  rsid2: str()
  name2: str()
  extra:
    abc: int()
    def: num()
# Comment 1
question2:
  choices2: list(include('choices')) # Comment 10
  questions2: list(include('question'), required=False)

data.yaml :

list_with_two_types:
  - name: "some SNP"
    rsid: "rs123"
  - "some"
  - "thing"
  - rsid: "rs312"
    name: 35
questions:
  - choices:
      - id: "id_str"
      - id: "id_str1"
    questions:
      - choices:
          - id: "id_str"
          - id: 66
---
list_with_two_types2:
  - name2: "some SNP"
    rsid2: "rs123"
  - "some2"
  - "thing2"
  - rsid2: "rs312"
    name2: 35
questions2:
  - choices2:
      - id2: "id_str"
      - id2: "id_str1"
    questions2:
      - choices2:
          - id2: "id_str"
          - id2: "id_str1"

And this is the output I am getting:

Validation failed!
Schema     : C:\Users\PC-G\Documents\Work\Workspace\sens\config\schema.yaml
Config     : C:\Users\PC-G\Documents\Work\Workspace\sens\config\data.yaml
----------------------------------------
* line    6:  list_with_two_types.3                    :  '{'rsid': 'rs312', 'name': 35}' is not a str.
* line    7:  list_with_two_types.3.name               :  '35' is not a str.
* line   15:  questions.0.questions.0.choices.1.id     :  '66' is not a str.
----------------------------------------
Schema     : C:\Users\PC-G\Documents\Work\Workspace\sens\config\schema.yaml
Config     : C:\Users\PC-G\Documents\Work\Workspace\sens\config\data.yaml
----------------------------------------
* line   24:  questions2                               :  Unexpected element
* line   17:  list_with_two_types2                     :  Unexpected element
* line    1:  list_with_two_types                      :  Required field missing
* line    8:  questions                                :  Required field missing
----------------------------------------
mildebrandt commented 4 years ago

Thanks for your interest in Yamale. I agree, the line number would be helpful. To be complete, we'd probably want to update the error class to hold the line numbers separately. It'll take a little thought. Thanks for the start towards that.

For your code, be careful using . as the separator since that can be part of the key. We use that as a separator for the output, but internally we use something else. I can see how that may cause confusion when reading the output, and we may need to revisit that later.

Thanks!

GilShoshan94 commented 4 years ago

My pleasure !

Should I close this "issue" ? I don't really know what people usually do, this is my first "contribution" in an open source project.

mildebrandt commented 4 years ago

No, we'll leave it open as a reminder that people would like to have a line number in the output.

ohare93 commented 4 years ago

Would very much appreciate this feature :+1:

mechie commented 2 years ago

Would a PR along these lines be welcome? Our yamale fork now has "promoted" error messages, which hold a few separate properties (error message, ancestry list, pointer to data evaluated, etc).

I need to look closer at the above, because having line numbers as another property would be useful for displaying to users.

Anyhow, with a few changes and cleanup, I think it'd be PR-able--if it's not stepping on a WIP, @mildebrandt.

mildebrandt commented 2 years ago

@mechie That would be very welcome....it seems there is interest and we haven't been working towards this feature internally.

ckadner commented 7 months ago

Great proposal. We are interested in this feature as well! :-)