UniversalDependencies / tools

Various utilities for processing the data.
GNU General Public License v2.0
205 stars 44 forks source link

Regular expressions in validator with new Python #98

Closed Stormur closed 7 months ago

Stormur commented 10 months ago

It seems that the new version of Python 3.12 requires that regular expressions are formatted as r'...' instead of simple strings '...'.

So, when calling the validator.py SyntaxWarnings are issued (e.g. for line 144 sentid_re=re.compile('^# sent_id\s*=\s*(\S+)$'))

If this is correct, probably the code needs this small update? It should be backward compatible, right?

nschneid commented 10 months ago

I think that's about the \s and \S escapes: they should be \\s and \\S unless it's an r-string.

foxik commented 10 months ago

I do not think that Python 3.12 would enforce raw strings for regular expressions. On the other hand, the current line https://github.com/UniversalDependencies/tools/blob/5363b77142778cba2d6cc1a50f74d010331508cf/validate.py#L144 either should use raw strings or should use double backslashes.

Stormur commented 10 months ago

There are more lines like this. The problem surfaced on a Mac with Python 3.12, while on my Linux system with Python 3.10 it does not come up.

(As a personal notes, given the choice I find raw strings more readable than escaped characters.)

AngledLuffa commented 6 months ago

There is still one left in the current validate.py:

tools/validate.py:684: SyntaxWarning: invalid escape sequence '\p'
  edeprelpart_resrc = '[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(_[\p{Ll}\p{Lm}\p{Lo}\p{M}]+)*';
dan-zeman commented 6 months ago

There is still one left in the current validate.py:

tools/validate.py:684: SyntaxWarning: invalid escape sequence '\p'
  edeprelpart_resrc = '[\p{Ll}\p{Lm}\p{Lo}\p{M}]+(_[\p{Ll}\p{Lm}\p{Lo}\p{M}]+)*';

Thanks! Fixed.