Faulty unicode escape handling leads to tokenizing failure

c2nes / javalang

Pure Python Java parser and tools

MIT License

738 stars 161 forks source link

It seems that javalang replaces unicode escapes back to the raw form (as pointed out in issue #58) in pre_tokenize method before tokenizing.

I don't get why this replacement is necessary (pre_tokenize method is added since the initial commit), and this may lead to failures in rare conditions.

Example:

>>> import javalang
>>> javalang.parse.parse(r'class Foo { String bar = "\u0022"; }')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Program Files\Python38\lib\site-packages\javalang\parse.py", line 52, in parse
    parser = Parser(tokens)
  File "C:\Program Files\Python38\lib\site-packages\javalang\parser.py", line 95, in __init__
    self.tokens = util.LookAheadListIterator(tokens)
  File "C:\Program Files\Python38\lib\site-packages\javalang\util.py", line 92, in __init__
    self.list = list(iterable)
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 535, in tokenize
    self.read_string()
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 201, in read_string
    self.error('Unterminated character/string literal')
  File "C:\Program Files\Python38\lib\site-packages\javalang\tokenizer.py", line 576, in error
    raise error
javalang.tokenizer.LexerError: Unterminated character/string literal at """, line 1: class Foo { String bar = """;

PR #96 fixes this issue and maybe we should merge it?

$ java -version openjdk version "11.0.10" 2021-01-19 OpenJDK Runtime Environment AdoptOpenJDK (build 11.0.10+9) OpenJDK 64-Bit Server VM AdoptOpenJDK (build 11.0.10+9, mixed mode) $ cat Foo.java class Foo { String bar = "\u0022"; } $ javac Foo.java Foo.java:1: error: unclosed string literal class Foo { String bar = "\u0022"; } ^ Foo.java:1: error: reached end of file while parsing class Foo { String bar = "\u0022"; } ^ 2 errors

c2nes / javalang

Faulty unicode escape handling leads to tokenizing failure #99