dabeaz / ply

Python Lex-Yacc
http://www.dabeaz.com/ply/index.html
2.78k stars 465 forks source link

extreme slowness #289

Closed florianschanda closed 6 months ago

florianschanda commented 1 year ago

Hi, I am trying to write a C++ lexer with PLY 3.11 and my lexer basically hangs on the following:

kitten_1::kitten_2::Kitten_3{kitten_1::kitten_2::kitten_4{"foo_bar_baz"},
                 cat_1::cat_2::cat_3()})};
std::unique_ptr<mmmmmmmm_mmmmmm::TtttTtttttttTttttttt>

Removing 5 t from the end of the last identifier allows the program to terminate in ~3.7 seconds, and removing just 4 t means allows it to terminate in 7.5 seconds.

I am not sure what the tool is doing, but CTRL+C shows its here:

Traceback (most recent call last):
  File "lexer.py", line 132, in <module>
    sanity_test()
  File "lexer.py", line 128, in sanity_test
    token = lexer.token()
  File "lexer.py", line 120, in token
    return self.lex.token()
  File "${HOME}/.local/lib/python3.8/site-packages/ply/lex.py", line 327, in token
    m = lexre.match(lexdata, lexpos)
KeyboardInterrupt

I've never had this issue with PLY so far, so I am really confused. This is my lexer so far. Note I've made sure that all the interesting regex do not match the empty string.

import sys
import re

import ply.lex

class CPP_Lexer:
    tokens = (
        "IDENTIFIER",
        "STRING",
        "INTEGER",
        "BRA", "KET",
        "C_BRA", "C_KET",
        "S_BRA", "S_KET",
        "PP_DIRECTIVE",
        "OPERATOR",
        "COMMA", "SEMI",
        "DOT",
        "ASSIGN",
        "NAMESPACE",
        "COLON",
    )

    def t_IDENTIFIER(self, t):
        r"[a-zA-Z_][a-zA-Z0-9_]*"
        return t

    t_BRA   = r"\("
    t_KET   = r"\)"
    t_C_BRA = r"\{"
    t_C_KET = r"\}"
    t_S_BRA = r"\["
    t_S_KET = r"\]"
    t_OPERATOR = "|".join("(%s)" % op
                          for op in (r"<[=<]?",
                                     r">[=>]?",
                                     r"&&?",
                                     r"\|\|?",
                                     r"\?",
                                     r"\+\+", r"--",
                                     r"[-+*/%^!]=?",
                                     r"&=", r"\|=",
                                     r"<<=", ">>="))
    assert re.match(t_OPERATOR, "") is None
    t_COMMA = r","
    t_SEMI  = r";"
    t_DOT   = r"\."
    t_ASSIGN = r"="
    t_NAMESPACE = r"::"
    t_COLON = r":"
    t_PP_DIRECTIVE = r"\#[a-zA-Z_][a-zA-Z0-9]* [^\n]*"

    int_bin = r"0[bB][01]('?[01])*"
    int_oct = r"0[0-7]('?[0-7])*"
    int_dec = r"[1-9]('?[0-9])*"
    int_hex = r"0[xX][0-9a-fA-F]('?[0-9a-fA-F])*"
    int_suffix = r"([uU]([lL]|ll|LL)?)|(([lL]|ll|LL)[uU]?)"
    int_regex = r"(%s)(%s)?" % ("|".join("(%s)" % s
                                         for s in ("0",
                                                   int_bin,
                                                   int_oct,
                                                   int_dec,
                                                   int_hex)),
                                int_suffix)
    assert re.match(int_regex, "") is None

    @ply.lex.TOKEN(int_regex)
    def t_INTEGER(self, t):
        return t

    str_simple_escape = r'''['"?\abfnrtv]'''
    str_octal_escape  = r'''[0-7]{1,3}'''
    str_hex_escape    = r'''x[0-9a-fA-F]{1,4}'''
    str_regex = r'''"((%s)|(%s))*"''' % ("|".join("(%s)" % s
                                                  for s in (str_simple_escape,
                                                            str_octal_escape,
                                                            str_hex_escape)),
                                         r"[^\\]")
    assert re.match(str_regex, "") is None

    @ply.lex.TOKEN(str_regex)
    def t_STRING(self, t):
        return t

    def t_newline(self, t):
        r"\n+"
        t.lexer.lineno += len(t.value)
        return None

    t_ignore = " \t"

    def t_c_comment(self, t):
        r'/\*(.|\n)*?\*/'
        t.lexer.lineno += t.value.count('\n')
        return None

    def t_cpp_comment(self, t):
        r'//[^\n]*'
        return None

    def t_error(self, t):
        print("%s:%u: illegal char: %s" % (self.filename,
                                           t.lexer.lineno,
                                           repr(t.value[0])))
        sys.exit(1)

    def __init__(self, filename, content=None):
        assert isinstance(filename, str)
        assert isinstance(content, str) or content is None
        self.lex      = ply.lex.lex(module=self)
        self.filename = filename
        if content is None:
            with open(filename, "r", encoding="UTF-8") as fd:
                self.content = fd.read()
        else:
            self.content = content
        self.lex.input(self.content)

    def token(self):
        return self.lex.token()

def sanity_test():
    lexer = CPP_Lexer(sys.argv[1])
    token = lexer.token()
    while token:
        print(token)
        token = lexer.token()

if __name__ == "__main__":
    sanity_test()
dabeaz commented 1 year ago

I'm not able to deep-dive into this, but there's not a whole lot to the PLY lexer other than repeatedly calling re.match(). That said, the re module is known to have pathological performance for certain kinds of regex. It's possible that you've somehow fallen into that by accident. Fixing it will require some further investigation.

florianschanda commented 1 year ago

OK Thanks for the fast response. I can try figure out if it's really a pathological regex as such, or if the way PLY combines the regexes makes it pathological.

yoosofan commented 1 year ago

Aside from your performance issue, there is a logical problem in your code. If you want to implement lexical analyzer for templates in C++ then your implementation of operators like >[=>]? will be mistaken by operator >> (bit-wise right shift operator and ostream cout) . It was C++ compiler problem from 1999 to 2011 because of not clear decision of the C++ committee. C++ programmer had to add extra space between > > for template.

florianschanda commented 1 year ago

Did I mention that I hate C++ with a burning passion? :)

What are the actual lexing rules in this case? Does the lexer need to be "template aware"? Because that's matlab levels of weirdness right there.

yoosofan commented 1 year ago

Did I mention that I hate C++ with a burning passion? :)

What are the actual lexing rules in this case? Does the lexer need to be "template aware"? Because that's matlab levels of weirdness right there.

I wanted to indicate the problem that might help you. AFAIK, this problem can't be solved in lexical level.