antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.17k stars 3.7k forks source link

TSQL: double quoted strings are not accepted #3027

Open cceelen opened 1 year ago

cceelen commented 1 year ago

The current TSQL lexer rule for strings does not allow double quoted strings: STRING options { caseInsensitive=false; } : 'N'? '\'' (~'\'' | '\'\'')* '\'' ; Microsoft uses only single quoted strings in their grammar description. This means that the following is rejected: CREATE PROCEDURE bla_bla AS BEGIN EXEC ("DELETE " + @param + 'more' + @param); END Rejecting this, means that input accepted by a number of other DB engines gets rejected by the generated parser.

cceelen commented 1 year ago

I was trying to replace it with this rule: STRING options { caseInsensitive=false; } : 'N'? ('\'' (~'\'' | '\'\'')* '\'' | '"' (~'"' | '\\"' )* '"' ); But this does not seem to work either.

kaby76 commented 1 year ago

Rejecting this, means that input accepted by a number of other DB engines gets rejected by the generated parser.

Is this Microsoft TSQL or a product from another company? The MariaDB grammar was forked from the MySQL grammar because MariaDB is not MySQL, and there were no semantic predicates for selecting the MariaDB extensions to the MySQL grammar, nor even commented as an extension. https://github.com/antlr/grammars-v4/pull/2940

I don't think Antlr supports defining a grammar with import of tsql and then allow for rules to be overridden.

kaby76 commented 1 year ago

Actually, it looks like Antlr does support overriding rules of an imported grammar. Below, SuperArithmetic.g4 imports Arithmetic.g4, and overrides rules in Arithmetic.g4.

grammar SuperArithmetic;

import Arithmetic;

file_ : expression EOF; // override parser rule to allow only one expression.
STRING : '"' .*? '"' ; // override lexer rule to only allow double quotes for a string.
grammar Arithmetic;

file_ : expression (SEMI expression)* EOF;
expression : expression POW expression | expression (TIMES | DIV) expression | expression (PLUS | MINUS) expression | LPAREN expression RPAREN | (PLUS | MINUS)* atom ;
atom : scientific | variable | string ;
scientific : SCIENTIFIC_NUMBER ;
variable : VARIABLE ;
string : STRING ;

STRING : '\'' .*? '\'';
VARIABLE : VALID_ID_START VALID_ID_CHAR* ;
SCIENTIFIC_NUMBER : NUMBER (E SIGN? UNSIGNED_INTEGER)? ;
LPAREN : '(' ;
RPAREN : ')' ;
PLUS : '+' ;
MINUS : '-' ;
TIMES : '*' ;
DIV : '/' ;
GT : '>' ;
LT : '<' ;
EQ : '=' ;
POINT : '.' ;
POW : '^' ;
SEMI : ';' ;
WS : [ \r\n\t] + -> channel(HIDDEN) ;

fragment VALID_ID_START : ('a' .. 'z') | ('A' .. 'Z') | '_' ;
fragment VALID_ID_CHAR : VALID_ID_START | ('0' .. '9') ;
fragment NUMBER : ('0' .. '9') + ('.' ('0' .. '9') +)? ;
fragment UNSIGNED_INTEGER : ('0' .. '9')+ ;
fragment E : 'E' | 'e' ;
fragment SIGN : ('+' | '-') ;
cceelen commented 1 year ago

Hi @kaby76 i was using the t-sql grammars and not the specific one for maria DB.

https://github.com/antlr/grammars-v4/tree/master/sql/tsql

The SQL I want to parse is from Sybase. So it's good to know that we can just go in and override some of the rules without the need to fork or rewrite them for a different engine.

I was trying to solve it in this way:

STRING options { caseInsensitive=false; } : 'N'? '\'' (~'\'' | '\'\'')* '\''  ;
DOUBLE_QUOTED_STRING options { caseInsensitive=false; } : '"' (~'"' | '\\"' )* '"' ;

execute_var_string
    : LOCAL_ID (OUTPUT | OUT)? ('+' LOCAL_ID ('+' execute_var_string)?)?
    | STRING ('+' LOCAL_ID ('+' execute_var_string)?)?
    | DOUBLE_QUOTED_STRING ('+' LOCAL_ID ('+' execute_var_string)?)?
    ;

This kind of worked when using that in the IDE plugin. It failed me once i generated the C++ grammar. Yet that might be a separate issue.

I tried in a few other ways as well (see initial comment) where i changed the just the STRING rule that did not work. So right now I lack experience with antlr to repaire this and send in a PR. Would you have some ideas what solution would be preferrable? Cheers

kaby76 commented 1 year ago

It looks like it can be done by import. See this preliminary code. https://github.com/kaby76/AntlrExamples/tree/master/sybase

I don't know yet how to include the tsql grammar from another directory, but will check.

This seems like it should work with C++ (see my new "desc.xml" description files, which currently shows what targets work with the grammar).

kaby76 commented 1 year ago

There is no xml element in the pom.xml for the Antlr4 Maven test extension. https://github.com/antlr/antlr4test-maven-plugin. So, that testing framework will only work if there's a copy of TSql*.g4 in the sybase grammar. Yuck.

But, that seems to be the way to go for now. So, your PR should add a directory in the repo for "sql/sybase", copies the preliminary source I gave, and adds an "examples/" directory with some examples. Might also want to add a readme.md explaining all this, copyright, license, your name, etc.

cceelen commented 1 year ago

I get different errors using different antlr tools. The JetBrains plugin cannot even find the new lexer rule while the VSCode plugin cannot deal with the tsql base grammar but at least identifies the new lexer rule to be referenced by a parser rule.

I decided to change the execute_var_string rule to match the Sybase definition much closer, which is not as restrictive as the one for the Microsoft definition.

The current result sadly rejects the main test (Execute.sql) where the IDE tooling and the java test both give me different results. Can someone point me into the right direction here?

kaby76 commented 1 year ago

Anything that deviates from a generated and compiled parser should be considered a bug. No IDE plugin should not be used as the definitive "result". This is because they use a different "atn interpreter" than the "gold standard" which is a generated parser, compiled, and linked. I never liked this idea of having two engines. It's confusing because one gets different results, and there are two code bases in which there one could have bugs, and not the other. Really bad. Java supports a compiler API in the OpenSDK. I think the Antlr Maven Tester uses the compiler dynamically.

You'll have to test just using generated parsers, not through any IDE or the online Antlr Lab.

cceelen commented 1 year ago

I think i have found one of the issues. The tools seem to use different options during grammar generations. Some are lenient and allow generation with warnings and other turn warnings into errors. After i started fixing the generated errors the behavior became much more consistent between the different tools. I should have the fix done by Monday end of day.

cceelen commented 1 year ago

I updated the code. It now works. IMO the best way was to override the rule STRING as it avoids change a all other rules accepted single quoted strings.