dabeaz / sly

Sly Lex Yacc
Other
816 stars 107 forks source link

Help with reduce/reduce conflict while parsing SQL #108

Closed leandropls closed 1 year ago

leandropls commented 1 year ago

I'm trying parse some simple SQL clauses and build them into SQLAlchemy objects. While parsing the example expressions below seem to work fine, I can't solve the generated reduce/reduce conflict. Can someone help me elucidate this?

This is a very simplified version to exemplify the problem I'm having:

from secrets import token_hex
from typing import Any

import sqlalchemy as sa
from sly import Lexer, Parser
from sqlalchemy.sql.elements import BindParameter, ClauseElement, ColumnClause

# noinspection PyUnresolvedReferences,PyUnboundLocalVariable,PyPep8Naming
class SQLLexer(Lexer):
    tokens = {
        BETWEEN,
        NUMBER,
        BOOLEAN,
        AND,
        COLUMN,
    }

    ignore = " \n"

    BETWEEN = r"BETWEEN"
    NUMBER = r"(\d+([.]\d*)?([eE][+-]?\d+)?|[.]\d+([eE][+-]?\d+)?)"
    BOOLEAN = r"TRUE|FALSE"
    AND = r"AND"
    COLUMN = r'"(?:[^"]|"")*"|[a-zA-Z_][a-zA-Z0-9_]*'

# noinspection PyUnresolvedReferences,PyUnusedLocal
class SQLParser(Parser):
    debugfile = "parser.txt"
    tokens = SQLLexer.tokens

    precedence = (
        ("left", AND),
        ("nonassoc", BETWEEN),
    )

    # Schema Item
    @_("COLUMN")
    def expr(self, p: tuple[str]) -> ColumnClause:
        return sa.column(p[0])

    # Number
    @_("NUMBER")
    def expr(self, p: tuple[str]) -> BindParameter:
        try:
            return wrap_value(int(p[0]))
        except ValueError:
            return wrap_value(float(p[0]))

    # Boolean
    @_("BOOLEAN")
    def expr(self, p: tuple[str]) -> BindParameter:
        if p[0].lower() == "true":
            return wrap_value(True)
        elif p[0].lower() == "false":
            return wrap_value(False)
        else:
            raise NotImplementedError("unknown boolean value")

    # AND
    @_("expr AND expr")
    def expr(
        self, p: tuple[ClauseElement, str, ClauseElement]
    ) -> tuple[ClauseElement, ClauseElement]:
        return sa.and_(p[0], p[2])

    # BETWEEN
    @_("expr BETWEEN expr AND expr")
    def expr(
        self, p: tuple[ClauseElement, str, ClauseElement, str, ClauseElement]
    ) -> tuple[ClauseElement, ClauseElement]:
        return p[0].between(p[2], p[4])

def wrap_value(value: Any) -> BindParameter:
    return sa.bindparam("param_" + token_hex(4), value)

if __name__ == "__main__":
    samples = [
        "myFlag AND FALSE",
        "rowNumber BETWEEN 1 AND 10",
    ]
    lexer = SQLLexer()
    parser = SQLParser()
    for sample in samples:
        print(f"input: {sample}", flush=True)
        tokens = list(lexer.tokenize(sample))
        print(f"tokens: {[(t.type, t.value) for t in tokens]}", flush=True)
        generated = parser.parse(iter(tokens))
        print(f"generated: {generated}", flush=True)
        print(flush=True)

And here's the output:

WARNING: 2 reduce/reduce conflicts
Parser debugging for SQLParser written to parser.txt
input: myFlag AND FALSE
tokens: [('COLUMN', 'myFlag'), ('AND', 'AND'), ('BOOLEAN', 'FALSE')]
generated: "myFlag" AND :param_da45c008

input: rowNumber BETWEEN 1 AND 10
tokens: [('COLUMN', 'rowNumber'), ('BETWEEN', 'BETWEEN'), ('NUMBER', '1'), ('AND', 'AND'), ('NUMBER', '10')]
generated: "rowNumber" BETWEEN :param_199c82be AND :param_01991b92

The conflict info:

Conflicts:

reduce/reduce conflict in state 10 resolved using rule expr -> expr AND expr  [precedence=left, level=1]
rejected rule (expr -> expr BETWEEN expr AND expr  [precedence=left, level=1]) in state 10

And state 10 info:

state 10

    (1) expr -> expr BETWEEN expr AND expr .
    (2) expr -> expr AND expr .
    (1) expr -> expr . BETWEEN expr AND expr
    (2) expr -> expr . AND expr
  ! reduce/reduce conflict for AND resolved using rule 2 (expr -> expr AND expr .)
  ! reduce/reduce conflict for BETWEEN resolved using rule 2 (expr -> expr AND expr .)
    $end            reduce using rule 1 (expr -> expr BETWEEN expr AND expr .)
    AND             reduce using rule 2 (expr -> expr AND expr .)
    BETWEEN         shift and go to state 5

I've had similar issues with expr IS expr and expr IS NOT expr which I wanted to treat as a.is_(b) and a.isnot(b) respectively, which I tried to solve with %prec with no success. But I'm assuming that figuring out this more complex (above) case will also handle the simpler one.

leandropls commented 1 year ago

Using @bugengine PR #89, I could see that it breaks because of the following construction:

   ╭╴
   │ expr BETWEEN expr AND expr ♦ AND expr
   │              ╰expr─────────╯
   │ ╰expr───────────────────────────────╯
   ├╴
   │ expr BETWEEN expr BETWEEN expr AND expr ♦ AND expr AND expr
   │                           ╰expr─────────╯
   │              ╰expr───────────────────────────────╯
   │ ╰expr─────────────────────────────────────────────────────╯
   ╰╴

Like,

select 4 between 5 between 1 and 10 and 9;

which would be a syntactic error in PostgreSQL. It'd need to be written as:

select 4 between (5 between 1 and 10) and 9;

But it isn't clear to me how to forbid such construction.

bugengine commented 1 year ago

ooooh my code was somewhat useful :) When looking at the SQL grammar (http://forcedotcom.github.io/phoenix/index.html#expression) I can see that an expression can expand to logic operation between optional conditions which are written as operand BETWEEN operand AND operand but your rules are different and allow recursion: expr BETWEEN expr AND expr which means that a condition could appear inside another one as one of the operands.

There are two ways out of this:

Rewritten grammar sample where I have expanded expressions to avoid recursion (I use left recursion to allow chaining of operations)


from typing import Any
import sys
sys.path.insert(0, 'sly')

import sqlalchemy as sa
from sly import Lexer, Parser
from sqlalchemy.sql.elements import BindParameter, ClauseElement, ColumnClause

# noinspection PyUnresolvedReferences,PyUnboundLocalVariable,PyPep8Naming
class SQLLexer(Lexer):
    tokens = {
        BETWEEN,
        NUMBER,
        BOOLEAN,
        AND,
        OR,
        COLUMN,
    }

    ignore = " \n"

    BETWEEN = r"BETWEEN"
    NUMBER = r"(\d+([.]\d*)?([eE][+-]?\d+)?|[.]\d+([eE][+-]?\d+)?)"
    BOOLEAN = r"TRUE|FALSE"
    AND = r"AND"
    OR = r"OR"
    COLUMN = r'"(?:[^"]|"")*"|[a-zA-Z_][a-zA-Z0-9_]*'

# noinspection PyUnresolvedReferences,PyUnusedLocal
class SQLParser(Parser):
    debugfile = "parser.txt"
    tokens = SQLLexer.tokens
    start = 'expr'

    precedence = (
        ("left", AND),
        ("nonassoc", BETWEEN),
    )

    # Schema Item
    @_("COLUMN")
    def term(self, p: tuple[str]) -> ColumnClause:
        return sa.column(p[0])

    # Number
    @_("NUMBER")
    def term(self, p: tuple[str]) -> BindParameter:
        try:
            return wrap_value(int(p[0]))
        except ValueError:
            return wrap_value(float(p[0]))

    # Boolean
    @_("BOOLEAN")
    def term(self, p: tuple[str]) -> BindParameter:
        if p[0].lower() == "true":
            return wrap_value(True)
        elif p[0].lower() == "false":
            return wrap_value(False)
        else:
            raise NotImplementedError("unknown boolean value")

    # OR
    @_("expr OR andCondition")
    def expr(
        self, p: tuple[ClauseElement, str, ClauseElement]
    ) -> tuple[ClauseElement, ClauseElement]:
        return sa.or_(p[0], p[2])

    @_("andCondition")
    def expr(
        self, p: tuple[ClauseElement]
    ) -> ClauseElement:
        return p[0]

    # AND
    @_("andCondition AND condition")
    def andCondition(
        self, p: tuple[ClauseElement, str, ClauseElement]
    ) -> tuple[ClauseElement, ClauseElement]:
        return sa.and_(p[0], p[2])

    @_("condition")
    def andCondition(
        self, p: tuple[ClauseElement]
    ) -> ClauseElement:
        return p[0]

    # BETWEEN
    @_("term BETWEEN term AND term")
    def condition(
        self, p: tuple[BindParameter, str, BindParameter, str, BindParameter]
    ) -> tuple[ClauseElement, ClauseElement]:
        return p[0].between(p[2], p[4])

    @_("term")
    def condition(
        self, p: tuple[BindParameter]
    ) -> BindParameter:
        return p[0]

def wrap_value(value: Any) -> BindParameter:
    return sa.bindparam("param_" + token_hex(4), value)

if __name__ == "__main__":
    samples = [
        "myFlag AND FALSE AND 1 AND 0 OR 2",
        "rowNumber BETWEEN 1 AND 10",
        "4 BETWEEN 5 BETWEEN 1 AND 10 AND 9",
    ]
    lexer = SQLLexer()
    parser = SQLParser()
    for sample in samples:
        print(f"input: {sample}", flush=True)
        tokens = list(lexer.tokenize(sample))
        print(f"tokens: {[(t.type, t.value) for t in tokens]}", flush=True)
        generated = parser.parse(iter(tokens))
        print(f"generated: {generated}", flush=True)
        print(flush=True)```
leandropls commented 1 year ago

It certainly did! The general idea of your solution does makes sense to me. I'm trying to wrap my mind around the details of it now.

leandropls commented 1 year ago

@bugengine for my complete use case their grammar still has some ambiguities. But your approach shed a light on the fact that I should really be using a well known SQL grammar instead of trying to write my own. I ended up using Antlr4's one. Thank you!

And your counterexample generator does help a lot!

lelit commented 1 year ago

I should really be using a well known SQL grammar instead of trying to write my one

In that mindset, especially if you are focused on PostgreSQL, yet another approach could be using pglast: it uses the real PG grammar, and exposes it as an AST of nodes.

leandropls commented 1 year ago

Thanks @lelit, I'll look into it!

My plan is to support only a small subset of the where clause. In general lines, the app has a table where the user will be able to do custom filtering by typing in the criteria.

I'm parsing this criteria provided by the user and building it into SQLAlchemy objects to later safely merge with the rest of the query, avoiding SQL injection problems that could arise if I used the user provided string directly.