aboutcode-org / saneyaml

Cleaner, simpler, safer and saner YAML parsing/serialization in Python, for YAML meant to be readable first, on top of PyYAML
9 stars 4 forks source link

is_iso_date pattern is too liberal #14

Closed ferdnyc closed 5 months ago

ferdnyc commented 5 months ago

I don't know if it's actually used anywhere, but the pattern used for is_iso_date:

https://github.com/nexB/saneyaml/blob/40e5fa7c0b6e0012452053839184e5cd29802063/src/saneyaml.py#L330

...is too liberal, because alternation (|) has higher precedence than most RE syntax other than parentheses. As a result, the pattern will match either anything that starts with 19, or a date of the form 20[0-9][0-9]-[01][0-9]-[0123]?[1-9].

To alternate only 19 and 20, but not the rest of the pattern, they need to be enclosed in parentheses: (19|20)

>>> import re
>>> is_iso_date = re.compile(
... r'19|20[0-9]{2}-[0-1][0-9]-[0-3]?[1-9]').match
>>>
>>> is_iso_date('1994-01-01')
<re.Match object; span=(0, 2), match='19'>
>>> is_iso_date('2004-01-01')
<re.Match object; span=(0, 10), match='2004-01-01'>
>>> is_iso_date('2004-01')
>>> is_iso_date('193')
<re.Match object; span=(0, 2), match='19'>
>>> is_iso_date('1992-bibble')
<re.Match object; span=(0, 2), match='19'>
>>>
>>> fixed_iso_date = re.compile(
... r'(19|20)[0-9]{2}-[0-1][0-9]-[0-3]?[1-9]').match
>>>
>>> fixed_iso_date('1994-01-01')
<re.Match object; span=(0, 10), match='1994-01-01'>
>>> fixed_iso_date('2004-01-01')
<re.Match object; span=(0, 10), match='2004-01-01'>
>>> fixed_iso_date('2004-01')
>>> fixed_iso_date('1994-01')
>>> fixed_iso_date('193')

...The tail end of the pattern is questionable, as well. Since the last character accepts only [1-9], but the character preceding it is optional, it will truncate the match for any day ending in 0, only matching on the first 10 characters (out of 11). This means that, for example, 2004-01-30 matches, but only as far as 2004-01-3. Therefore, broken strings like 2004-01-3a will also match.

>>> # Continuing from above
>>> is_iso_date('2004-01-10')
<re.Match object; span=(0, 9), match='2004-01-1'>
>>> is_iso_date('2004-01-1')
<re.Match object; span=(0, 9), match='2004-01-1'>
>>> is_iso_date('2004-01-30')
<re.Match object; span=(0, 9), match='2004-01-3'>
>>> is_iso_date('2004-01-31')
<re.Match object; span=(0, 10), match='2004-01-31'>
>>> is_iso_date('2004-01-0a')
>>> is_iso_date('2004-01-1a')
<re.Match object; span=(0, 9), match='2004-01-1'>

The pattern should be:

r'(19|20)[0-9]{2}-[0-1][0-9]-[0-3][0-9]'

with a non-optional leading day digit, since ISO-8601 doesn't recognize strings of the form 2004-1-1 or 2004-01-1 as valid dates; all 8 digits are required.