23andMe / Yamale

A schema and validator for YAML.
MIT License
670 stars 88 forks source link

Feature Adds Day/Timestamp format Constraint #152

Closed zbremmer closed 2 years ago

zbremmer commented 3 years ago

This PR introduces the format constraint to both the Day and Timestamp validators.

The PyYAML and ruamel.yaml loaders have been modified to not automatically convert strings to dates or datetimes. Instead, dates and datetimes are loaded as strings. During the validation step, the string values are passed to the Day and Timestamp validators which try to coerce to a date/datetime using the value passed in the format argument of the schema definition file. If the format argument is not used, a default date/datetime parser util.parse_default_date() is used. This parser mimics the default PyYAML and ruamel.yaml behavior.

This PR also includes updates to the README file and new and updated tests.

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 780e2c761c224d26225510a45014b7a49576b571 into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

mildebrandt commented 3 years ago

Thanks, this is looking good. Please take a glance at the build output. Also, I'd like to see the following tests:

For some of these, you can add to yamale/tests/fixtures/keywords.yaml and yamale/tests/fixtures/types.yaml. The types.yaml doesn't even include dates....must have missed that when that validator was added.

Thanks so much!

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 6d067691c54840fe937440a7c179b389815df223 into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 573e67c49c99b2230455021d3c5a5e7a147b0fed into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 21ed4d9cd755b85147ca6f70db316d17c75594b6 into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 6dd5bdd7cc395e0f55d4d95c4dd70c984ad068fa into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 2ea1974a7b0ad0448e87174575bf2854b835d5cc into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging 3cab1a12d12b5260c8d587feea9af7cbc7479df7 into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

lgtm-com[bot] commented 3 years ago

This pull request introduces 1 alert when merging baccaaa506365b7afde88290a7d5718f1248d7a3 into 856c355d8762031c4f22167255e7ce9fb15bf6e5 - view on LGTM.com

new alerts:

zbremmer commented 3 years ago

I added in the first two sets of tests. I'm not sure where to add in the "bad format (doesn't correspond to strptime)" test. As written, the format argument takes a string but does not test that it can be used in strptime until Day.validate or Timestamp.validate is run. If someone passes in something like Day(format='not valid format') an error will not be thrown until the yamale.validate() function is kicked off.

I don't see any good hook to validate that the value passed to format is usable by strptime. I actually can't even find a way to validate that a format string is usable by strptime without calling strptime on some date value, which will just give a ValueError since the date won't match the nonsense format. But it won't necessarily indicate that the format is wrong, just that they don't match. Is it enough that the error will be thrown at runtime, or do we need to check that the value passed in the schema is valid? Any suggestions would be welcome.

I also see that the build is failing one the test_ruamel_base_case test but when I run it locally it completes successfully. I'm not sure what the issue is here--any help would be appreciated.

mildebrandt commented 3 years ago

For the build failure, are you running make test to execute the tests? I get the same error on my side.

zbremmer commented 3 years ago

No, I was running the tests through pytest. I'm not familiar with the make test command, can you give me some guidance?

mildebrandt commented 3 years ago

Sure, make is one of the original build tools that is still around today: https://en.wikipedia.org/wiki/Make_(software)

The input is a file called Makefile. You'll see this at the root of the yamale directory: https://github.com/23andMe/Yamale/blob/master/Makefile

When someone runs make, they provide a target....in this case test. For yamale, that will run the tox command:

test:
    @tox

You can get the same behavior by just running tox. You'll need to have both python3.6 and python3.8 executables in your path.

mildebrandt commented 3 years ago

Here are some examples of what comes back from a successful strptime, a format that doesn't match, and a bad format:

>>> datetime.strptime("21 June, 2018", "%d %B, %Y")
datetime.datetime(2018, 6, 21, 0, 0)

>>> datetime.strptime("21 June, 2018", "%d %B, %Y %a")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_strptime.py", line 349, in _strptime
    raise ValueError("time data %r does not match format %r" %
ValueError: time data '21 June, 2018' does not match format '%d %B, %Y %a'

>>> datetime.strptime("21 June, 2018", "%d %B, %Y %L")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_strptime.py", line 568, in _strptime_datetime
    tt, fraction, gmtoff_fraction = _strptime(data_string, format)
  File "/usr/local/Cellar/python@3.9/3.9.1_6/Frameworks/Python.framework/Versions/3.9/lib/python3.9/_strptime.py", line 341, in _strptime
    raise ValueError("'%s' is a bad directive in format '%s'" %
ValueError: 'L' is a bad directive in format '%d %B, %Y %L'

In the tests, it would be good to check that we're getting back an error message that includes "is a bad directive in format" in the error string.

For the runtime itself, since both types of errors return a ValueError, there's no real way to pre-check. We can rely on the runtime to throw the appropriate message.

mildebrandt commented 3 years ago

Oh, and make sure to take a look at the alert in the LGTM output: https://github.com/23andMe/Yamale/pull/152/checks?check_run_id=1751370323

It passed, but we'd rather it pass with no alerts. Thanks!

zbremmer commented 3 years ago

Thanks for the help @mildebrandt. I added in the bad directive test but I had to change the default ValueError in Day.validate() to return the 'bad directive' message.

I fixed the LGTM alert by removing the try...except block in _pyyaml(). Since the NoDatesSafeLoader is now the default loader and is defined in yaml_reader.py we shouldn't need to worry about catching an exception here.

Thanks for the info on make, I've not used that before. I got this up and running and am still getting the error in yamale/tests/test_readers.py. I've tried running the actual test function in both python3.6 and python3.8 and both of them come back with no errors. Below is the modified function to run in the console:

>>> def test_ruamel_base_case(): 
...     # Test that util function converts dates to the same as ruamel base loader
...     r_yaml = YAML(typ='safe')
...     for i in DATE_LIST: 
...         date_one = parse_default_date(i)
...         print(type(date_one))
...         date_two = list(r_yaml.load_all(i))[0]
...         print(type(date_two))
...         print(type(date_one) == type(date_two))
... 
>>> test_ruamel_base_case()
<class 'datetime.date'>
<class 'datetime.date'>
True
<class 'str'>
<class 'str'>
True
<class 'datetime.datetime'>
<class 'datetime.datetime'>
True
<class 'datetime.datetime'>
<class 'datetime.datetime'>
True
<class 'datetime.datetime'>
<class 'datetime.datetime'>
True
<class 'str'>
<class 'str'>
True
<class 'str'>
<class 'str'>
True

I'm stumped. Could there be something weird going on during the build that is changing the type returned by r_yaml.load_all() from a datetime to a string?

mildebrandt commented 3 years ago

Ah, sorry I didn't notice that before. We depend on CSafeLoader for fast processing of our yaml files. Can you ensure your changes include the option to use that load if it's available?

Are you saying that running the tests with tox locally, you're still not getting the same errors as the build system? I ran with your print statements through tox and got the following output:

<class 'datetime.date'>
<class 'str'>
False
<class 'str'>
<class 'str'>
True
<class 'datetime.datetime'>
<class 'str'>
False
<class 'datetime.datetime'>
<class 'str'>
False
<class 'datetime.datetime'>
<class 'str'>
False
<class 'str'>
<class 'str'>
True
<class 'str'>
<class 'str'>
True

Hope that output helps.

zbremmer commented 3 years ago

For CSafeLoader I can add that back in but the default will still be NoDatesSafeLoader. That's the custom loader I built to prevent yaml.load() from parsing dates into datetime objects.

I am getting the same errors through tox, but it shouldn't be giving the output above. For example, the first date run through the loop in test_ruamel_base_case() is 2010-01-01. When we feed that through util.parse_default_date() it returns a datetime.date object as shown in the first line of your results. This is correct. However, if I open a python shell and run that date through ruamel I get the following:

>>> from ruamel.yaml import YAML
>>> list(YAML(typ='safe').load_all('2010-01-01'))
[datetime.date(2010, 1, 1)]
>>> type(list(YAML(typ='safe').load_all('2010-01-01'))[0])
<class 'datetime.date'>

Yet in tox it lists the output as type string. That's what I don't understand. If ruamel is returning it as a datetime.date why is the output saying the type is a string for all dates in DATE_LIST?

mildebrandt commented 3 years ago

So, your NoDatesSafeLoader is extending the SafeLoader class. Please detect if CSafeLoader is available, and extend from that. If it's not available, then extend from SafeLoader.

As for the actual date parsing errors when using tox, I'm not too sure what to say. Unfortunately I don't have a lot of time to help debug ruamel. If I do get some free cycles this week, I'll jump in.

zbremmer commented 3 years ago

Understood. I updated NoDatesSafeLoader to extend CSfaeLoader if found.

I think there may be something goofy going on in the namespace causing the error. In yaml_reader.py I have the following:

def _ruamel(f):
    from ruamel.yaml import YAML
    yaml = YAML(typ='safe')

    # Replace timestamp constructor to prevent converting to datetime obj
    yaml.constructor.yaml_constructors[u'tag:yaml.org,2002:timestamp'] = \
        yaml.constructor.yaml_constructors[u'tag:yaml.org,2002:str']

    return list(yaml.load_all(f))

The yaml.constructor.yaml_constructor... line is what changes the loader to not parse datetime objects and instead keep them as strings.

At the beginning of test_readers.py I am have the following imports:

from .. import readers
from ruamel.yaml import YAML

The odd thing is that if I comment out the yaml.constructor.yaml_constructor... assignment in the _ruamel method above, the error with test_ruamel_base_case disappears.

Could it be that when I import readers I am also importing the updated instance of ruamel.yaml which is being called instead of the fresh import that I execute in test_readers.py? I've noticed that if I remove the import ruamel.yaml as YAML line from test_readers.py I don't get any error when running r_yaml = YAML(typ='safe') so it's got to be coming in somewhere.

mildebrandt commented 3 years ago

Hi @zbremmer! It's been really busy over here, but I did find a little time on a Saturday morning. So I grabbed a milk tea, a chocolate croissant, and took a quick look. Here's where the issue is:

$ python
Python 3.9.1 (default, Jan  8 2021, 17:17:43) 
[Clang 12.0.0 (clang-1200.0.32.28)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> from ruamel.yaml import YAML 
>>> r_yaml = YAML(typ='safe')
>>> i = "2010-01-01"
>>> list(r_yaml.load_all(i))[0]
datetime.date(2010, 1, 1)
>>> r_yaml.constructor.yaml_constructors[u'tag:yaml.org,2002:timestamp'] = r_yaml.constructor.yaml_constructors[u'tag:yaml.org,2002:str']
>>> list(r_yaml.load_all(i))[0]
'2010-01-01'
>>> r_yaml2 = YAML(typ='safe')
>>> list(r_yaml2.load_all(i))[0]
'2010-01-01'
>>> 

It seems setting the yaml constructor like this sets it globally for all instances of ruamel. So when another instance is created, (r_yaml2 = YAML(typ='safe')), it still uses the constructors set on the previous instance.

I hope that leads you in the right direction. Good luck!

mildebrandt commented 2 years ago

Closing due to lack of activity.