antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.25k stars 3.72k forks source link

Request: Add grammar for Julia lang. #2076

Open achjaj opened 3 years ago

achjaj commented 3 years ago

I would like to request a grammar file for Julia language if it is possible.

kaby76 commented 3 years ago

@achjaj I cannot find a spec on the Julia website, e.g., here or here both returning basically nothing useful. As far as I can tell, there is no specification of the language, which makes this a hard thing to do. See this open issue since 2013. Yes, 8 years ago! Apparently, this is how so-called "great developers" define things nowadays. Totally ignorant. It might be possible to scrape the code to derive a CFG. But, who knows what the static semantics would be.

achjaj commented 3 years ago

@kaby76 Would this help?

kaby76 commented 3 years ago

@achjaj I haven't had much time to look at this, but I was planning to scrape the grammar from the implementation. It looks like the grammar for Julia is an LR grammar in an ".scm" file here. Presumably that's "Essence" LR parser generator format? I haven't read the build files. Do you know?

achjaj commented 3 years ago

@kaby76 I am sorry, but I don't know.

kaby76 commented 3 years ago

Additional information for anyone interested:

https://docs.julialang.org/en/v1/devdocs/eval/#dev-parsing. The parser is based on "femtolisp", a small lisp interpreter hack project. The code seems to be a flavor of Scheme, and it is integrated in the Julia source flisp.c. The original interpreter is in Github, with the flisp.c file here. The source--of course--diverges from each other significantly. And, there doesn't seem to be any debugging tools for the interpreter other than a C debugger. But, Julia's Femtolisp seems to have undergone a Windows port. No telling if it still works.

The ASTs are wrapped up with the "grammar"--of course, without any scraping tool that partitions the two specifications from each other as what should be done for every damn parser generator.

ShalokShalom commented 1 year ago

Julia switches to a new parser. https://github.com/JuliaLang/JuliaSyntax.jl#status And here is a CST parser: https://github.com/julia-vscode/CSTParser.jl

I welcome the addition of Julia, thanks. ๐Ÿ‘๐Ÿป

kaby76 commented 1 year ago

Julia switches to a new parser. https://github.com/JuliaLang/JuliaSyntax.jl#status And here is a CST parser: https://github.com/julia-vscode/CSTParser.jl

I welcome the addition of Julia, thanks. ๐Ÿ‘๐Ÿป

Thanks, that may be helpful. As far as I can tell, though, there is still no formal grammar that defines the concrete syntax. The language is defined "operationally". For the lexer, the rules are encoded in code. It has to be reverse engineered. This has always been the main problem.

But, this is what ChatGPT4 provided when asked to reverse engineer the code.

<lex_input> ::= <token>* <eof>

<token> ::= <whitespace>
         | <keyword>
         | <identifier>
         | <operator>
         | <punctuation>
         | <literal>
         | <comment>
         | <miscellaneous>

<whitespace> ::= <space>+ | <newline> | <tab> | ...

<keyword> ::= 'function' | 'abstract' | 'begin' | ...  (and all the others listed in the code)

<identifier> ::= <letter> <alphanumeric>*

<operator> ::= '=' | '==' | '===' | '!=' | ...  (and all the others listed in the code)

<punctuation> ::= '[' | ']' | '{' | '}' | '(' | ')' | ',' | ';' | ...

<literal> ::= <integer> | <float> | <hex_int> | <bin_int> | <oct_int> | <char> | <string>

<integer> ::= <digit>+
<float> ::= <digit>* '.' <digit>+ ('e' <digit>+)?
<hex_int> ::= '0x' <hex_digit>+
<bin_int> ::= '0b' <bin_digit>+
<oct_int> ::= '0o' <oct_digit>+
<char> ::= '\'' (any character except '\'') '\'' | '\' escape_sequence '\'
<string> ::= '"' (any character)* '"' | '"""' (any character)* '"""'

<comment> ::= '#' (any character)* <newline>

<miscellaneous> ::= ... (any other tokens handled by the lexer but not explicitly covered above)

<letter> ::= 'a' | 'b' | ... | 'z' | 'A' | ... | 'Z'
<digit> ::= '0' | ... | '9'
<hex_digit> ::= <digit> | 'a' | ... | 'f' | 'A' | ... | 'F'
<bin_digit> ::= '0' | '1'
<oct_digit> ::= '0' | ... | '7'

I don't think it is an exhaustive and explicit reverse engineering of that code because it gives "..." for several of the rules. And, it does not actually give a rule for LSQUARE.

When prompted to do so, along with instructions to be as explicit as possible, this is what it outputed.

Lexer       ::= {Whitespace | Token | Operator | Punctuation | Comment} EndMarker ;
Whitespace  ::= {Space | Tab | NewLine} ;
Token       ::= Identifier | StringLiteral | CharLiteral | NumericLiteral | Comment ;
Operator    ::= ArithmeticOp | LogicalOp | RelationalOp ;
Punctuation ::= '[' | ']' | '{' | '}' | '(' | ')' | ';' | ',' ;
Comment     ::= '#' {AnyChar - NewLine} NewLine ;

Identifier      ::= {Letter | Digit | '_'} ;
StringLiteral   ::= '"' {AnyChar - '"'} '"' | '"""' {AnyChar} '"""' ;
CharLiteral     ::= '\'' AnyChar '\'' ;
NumericLiteral  ::= Integer | Float ;
Integer         ::= Digit {Digit} ;
Float           ::= Digit {Digit} '.' {Digit} ;
ArithmeticOp    ::= '+' | '-' | '*' | '/' | '^' ;
LogicalOp       ::= '&&' | '||' ;
RelationalOp    ::= '<' | '>' | '<=' | '>=' | '==' | '!=' ;

Space    ::= ' ' ;
Tab      ::= '\t' ;
NewLine  ::= '\n' ;
Letter   ::= 'a'..'z' | 'A'..'Z' ;
Digit    ::= '0'..'9' ;
AnyChar  ::= ? any visible character ? ;
EndMarker ::= EOF ;

But, now we have a grammar that defines the input as containing only one token, followed by the EOF.

It would be worth trying this on some of the newer LLMs coming out soon.

ShalokShalom commented 3 months ago

Hi there :)

Treesitter has an implementation, does this maybe count as formal syntax that describes the concrete grammar?

I assume Claude 3.5 Sonnet could be able to transpile to ANTLR from Treesitter maybe?

ShalokShalom commented 3 months ago

I got this from Claude:

grammar Julia;

// Parser Rules
program: statement* EOF;

statement
    : expression
    | declaration
    | assignment
    | controlFlow
    | moduleDefinition
    | importStatement
    | usingStatement
    | macroDefinition
    | quoteExpression
    ;

expression
    : literal
    | identifier
    | functionCall
    | operator
    | parenthesizedExpression
    | arrayLiteral
    | dictLiteral
    | comprehension
    | rangeExpression
    | lambdaExpression
    | typeAssertion
    | ternaryExpression
    | pipeExpression
    ;

declaration
    : functionDeclaration
    | variableDeclaration
    | typeDeclaration
    | abstractTypeDeclaration
    ;

assignment: (identifier | tupleUnpacking) '=' expression;

controlFlow
    : ifStatement
    | forLoop
    | whileLoop
    | tryStatement
    | doStatement
    ;

moduleDefinition: 'module' identifier statement* 'end';

importStatement: 'import' importList;
usingStatement: 'using' importList;

importList: (identifier ('.' identifier)* (',' identifier ('.' identifier)*)*)? (': ' identifier (',' identifier)*)?;

functionDeclaration: 'function' identifier typeParameters? parameterList functionBody 'end';

variableDeclaration: ('local' | 'global' | 'const')? identifier (':' type)? ('=' expression)?;

typeDeclaration: 'struct' identifier typeParameters? fields 'end';
abstractTypeDeclaration: 'abstract type' identifier typeParameters? 'end';

macroDefinition: 'macro' identifier parameterList macroBody 'end';

quoteExpression: 'quote' statement* 'end' | ':(' expression ')';

comprehension: '[' expression forClause+ (ifClause)* ']';

rangeExpression: expression ':' expression (':' expression)?;

lambdaExpression: parameterList '->' expression;

typeAssertion: expression '::' type;

ternaryExpression: expression '?' expression ':' expression;

pipeExpression: expression '|>' expression;

// Precedence and Associativity Rules
expression
    : <assoc=right> expression '^' expression
    | expression ('*' | '/' | '%') expression
    | expression ('+' | '-') expression
    | <assoc=right> expression '::' type
    | expression COMPARISON_OPERATOR expression
    | expression '&&' expression
    | expression '||' expression
    | <assoc=right> expression '?' expression ':' expression
    ;

// Lexer Rules
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_!]*;
INTEGER: [0-9]+ | '0x' [0-9a-fA-F]+ | '0b' [01]+;
FLOAT: [0-9]+ '.' [0-9]* ([eE] [+-]? [0-9]+)? | [0-9]+ [eE] [+-]? [0-9]+;
STRING: '"' .*? '"' | '"""' .*? '"""';
CHAR: '\'' . '\'' | '\'' '\\' . '\'';

COMMENT: '#' .*? '\r'? '\n' -> skip;
WS: [ \t\r\n]+ -> skip;

// Operators
OPERATOR: [+\-*/^%&|<>]=? | '==' | '!=' | '<=' | '>=' | '::' | '->' | '&&' | '||' | '<<' | '>>' | '>>>' | '.+' | '.-' | '.*' | './' | '.^' | '\\' | 'โˆ˜' | 'โ‹…' | 'ร—' | 'รท' | 'โˆช' | 'โˆฉ' | 'โˆš' | 'โˆ›' | 'โˆœ';
COMPARISON_OPERATOR: '==' | '!=' | '<' | '<=' | '>' | '>=';

// Keywords
KEYWORD: 'function' | 'end' | 'if' | 'else' | 'elseif' | 'for' | 'while' | 'try' | 'catch' | 'finally' | 'return' | 'break' | 'continue' | 'struct' | 'mutable' | 'abstract' | 'type' | 'const' | 'let' | 'do' | 'macro' | 'quote' | 'where';

// Standard Library Functions
STDLIB_FUNCTION: 'println' | 'print' | 'push!' | 'pop!' | 'length' | 'sort' | 'map' | 'filter' | 'reduce' | 'sum' | 'maximum' | 'minimum';
kaby76 commented 3 months ago

Well, if you take that grammar and try it, you'll find it doesn't work.

$ antlr4 Julia.g4
error(50): Julia.g4:104:23: syntax error: '=' came as a complete surprise to me while looking for lexer rule element
08/31-06:37:55 ~/issues/g4-2076

On line 104, the character set for OPERATOR is wrong because ']' has to be escaped. Correcting that, I try again:

$ antlr4 Julia.g4
warning(184): Julia.g4:105:0: One of the token COMPARISON_OPERATOR values unreachable. == is always overlapped by token OPERATOR
warning(184): Julia.g4:105:0: One of the token COMPARISON_OPERATOR values unreachable. != is always overlapped by token OPERATOR
warning(184): Julia.g4:105:0: One of the token COMPARISON_OPERATOR values unreachable. <= is always overlapped by token OPERATOR
warning(184): Julia.g4:105:0: One of the token COMPARISON_OPERATOR values unreachable. >= is always overlapped by token OPERATOR
error(51): Julia.g4:82:0: rule expression redefinition; previous at line 18
error(56): Julia.g4:19:6: reference to undefined rule: literal
error(56): Julia.g4:20:6: reference to undefined rule: identifier
error(56): Julia.g4:21:6: reference to undefined rule: functionCall
error(56): Julia.g4:22:6: reference to undefined rule: operator
error(56): Julia.g4:23:6: reference to undefined rule: parenthesizedExpression
error(56): Julia.g4:24:6: reference to undefined rule: arrayLiteral
error(56): Julia.g4:25:6: reference to undefined rule: dictLiteral
error(56): Julia.g4:41:13: reference to undefined rule: identifier
error(56): Julia.g4:41:26: reference to undefined rule: tupleUnpacking
error(56): Julia.g4:44:6: reference to undefined rule: ifStatement
error(56): Julia.g4:45:6: reference to undefined rule: forLoop
error(56): Julia.g4:46:6: reference to undefined rule: whileLoop
error(56): Julia.g4:47:6: reference to undefined rule: tryStatement
error(56): Julia.g4:48:6: reference to undefined rule: doStatement
error(56): Julia.g4:51:27: reference to undefined rule: identifier
error(56): Julia.g4:56:13: reference to undefined rule: identifier
error(56): Julia.g4:56:29: reference to undefined rule: identifier
error(56): Julia.g4:56:47: reference to undefined rule: identifier
error(56): Julia.g4:56:63: reference to undefined rule: identifier
error(56): Julia.g4:56:86: reference to undefined rule: identifier
error(56): Julia.g4:56:102: reference to undefined rule: identifier
error(56): Julia.g4:58:32: reference to undefined rule: identifier
error(56): Julia.g4:58:43: reference to undefined rule: typeParameters
error(56): Julia.g4:58:59: reference to undefined rule: parameterList
error(56): Julia.g4:58:73: reference to undefined rule: functionBody
error(56): Julia.g4:60:53: reference to undefined rule: identifier
error(56): Julia.g4:60:69: reference to undefined rule: type
error(56): Julia.g4:62:26: reference to undefined rule: identifier
error(56): Julia.g4:62:37: reference to undefined rule: typeParameters
error(56): Julia.g4:62:53: reference to undefined rule: fields
error(56): Julia.g4:63:41: reference to undefined rule: identifier
error(56): Julia.g4:63:52: reference to undefined rule: typeParameters
error(56): Julia.g4:65:25: reference to undefined rule: identifier
error(56): Julia.g4:65:36: reference to undefined rule: parameterList
error(56): Julia.g4:65:50: reference to undefined rule: macroBody
error(56): Julia.g4:69:30: reference to undefined rule: forClause
error(56): Julia.g4:69:42: reference to undefined rule: ifClause
error(56): Julia.g4:73:18: reference to undefined rule: parameterList
error(56): Julia.g4:75:31: reference to undefined rule: type
error(56): Julia.g4:86:36: reference to undefined rule: type
08/31-06:38:44 ~/issues/g4-2076

Most of the grammar is missing ("reference to undefined rule...").

I repeated the experiment from the website, and got a different answer. J2.g4.txt After changing the grammar name to correspond to the file as per Antlr4 Tool required, and correcting the character set problem, the Antlr output a few errors.

$ antlr4 J2.g4
error(160): J2.g4:399:71: reference to parser rule interpolationExpression in lexer rule SCOPED_IDENTIFIER
error(160): J2.g4:399:97: reference to parser rule quoteExpression in lexer rule SCOPED_IDENTIFIER
error(56): J2.g4:14:51: reference to undefined rule: operator
error(56): J2.g4:31:6: reference to undefined rule: operator
error(56): J2.g4:115:26: reference to undefined rule: letBinding
error(56): J2.g4:115:57: reference to undefined rule: letBinding
error(56): J2.g4:179:26: reference to undefined rule: scopedIdentifier
error(56): J2.g4:184:6: reference to undefined rule: macroIdentifier
error(56): J2.g4:185:6: reference to undefined rule: operator
error(56): J2.g4:192:6: reference to undefined rule: scopedIdentifier
error(56): J2.g4:300:27: reference to undefined rule: operator
error(56): J2.g4:308:31: reference to undefined rule: macroIdentifier
error(56): J2.g4:316:26: reference to undefined rule: namedArgument
error(56): J2.g4:336:75: reference to undefined rule: operator
error(56): J2.g4:336:177: reference to undefined rule: operator
error(56): J2.g4:336:193: reference to undefined rule: operator
error(56): J2.g4:352:17: reference to undefined rule: operator
error(56): J2.g4:356:6: reference to undefined rule: operator
error(56): J2.g4:388:24: reference to undefined rule: operator
08/31-06:45:49 ~/issues/g4-2076

So while it's syntactically an Antlr4 grammar, the grammar still needs further corrections.

But, as I pointed out in Antlr4 Discussions, LLMs are presented via a website, which makes automated scraping nearly impossible. And, any script that corrects the problems in the LLM-translated tree-sitter grammar may not work because the LLM is constantly changing, and providing a different translation to Antlr4.

It would be easier to just write a tool to read the tree-sitter grammar (actually, an AST that people somehow write in), then convert that to Antlr representation. This would eliminate the conversion being subjected to changes in the LLM. I was working on this a couple of years ago, but it's low priority.

ShalokShalom commented 3 months ago

I do know next to nothing about Antlr and parsers, while I do know a bit about how to work with AI driven tools by now. Usually, the most success is to be found, when the human and the LLM discuss things back and forth. By giving it the feedback that you just gave me, and by respecting the context window, I find good results. And yes, they still hallucinate quite a bit, but this is exactly where the feedback comes into play.

ShalokShalom commented 3 months ago

I responded with the feedback that you provided to me, and this is the result:

grammar Julia;

// Parser Rules
program: statement* EOF;

statement
    : expression
    | declaration
    | assignment
    | controlFlow
    | moduleDefinition
    | importStatement
    | usingStatement
    | macroDefinition
    | quoteExpression
    ;

expression
    : literal
    | identifier
    | functionCall
    | operator
    | parenthesizedExpression
    | arrayLiteral
    | dictLiteral
    | comprehension
    | rangeExpression
    | lambdaExpression
    | typeAssertion
    | ternaryExpression
    | pipeExpression
    ;

declaration
    : functionDeclaration
    | variableDeclaration
    | typeDeclaration
    | abstractTypeDeclaration
    ;

assignment: (identifier | tupleUnpacking) '=' expression;

controlFlow
    : ifStatement
    | forLoop
    | whileLoop
    | tryStatement
    | doStatement
    ;

moduleDefinition: 'module' identifier statement* 'end';

importStatement: 'import' importList;
usingStatement: 'using' importList;

importList: (identifier ('.' identifier)* (',' identifier ('.' identifier)*)*)? (': ' identifier (',' identifier)*)?;

functionDeclaration: 'function' identifier typeParameters? parameterList functionBody 'end';

variableDeclaration: ('local' | 'global' | 'const')? identifier (':' type)? ('=' expression)?;

typeDeclaration: 'struct' identifier typeParameters? fields 'end';
abstractTypeDeclaration: 'abstract type' identifier typeParameters? 'end';

macroDefinition: 'macro' identifier parameterList macroBody 'end';

quoteExpression: 'quote' statement* 'end' | ':(' expression ')';

comprehension: '[' expression forClause+ (ifClause)* ']';

rangeExpression: expression ':' expression (':' expression)?;

lambdaExpression: parameterList '->' expression;

typeAssertion: expression '::' type;

ternaryExpression: expression '?' expression ':' expression;

pipeExpression: expression '|>' expression;

// Additional rule definitions
literal: INTEGER | FLOAT | STRING | CHAR | BOOLEAN;
identifier: IDENTIFIER;
functionCall: identifier '(' (expression (',' expression)*)? ')';
parenthesizedExpression: '(' expression ')';
arrayLiteral: '[' (expression (',' expression)*)? ']';
dictLiteral: '{' (keyValuePair (',' keyValuePair)*)? '}';
keyValuePair: expression '=>' expression;
forClause: 'for' identifier 'in' expression;
ifClause: 'if' expression;
tupleUnpacking: '(' identifier (',' identifier)* ')';
typeParameters: '{' identifier (',' identifier)* '}';
parameterList: '(' (parameter (',' parameter)*)? ')';
parameter: identifier (':' type)?;
functionBody: statement*;
macroBody: statement*;
type: identifier | parameterizedType;
parameterizedType: identifier '{' type (',' type)* '}';
fields: (identifier (':' type)? ('=' expression)?)*;

ifStatement: 'if' expression statement* ('elseif' expression statement*)* ('else' statement*)? 'end';
forLoop: 'for' identifier 'in' expression statement* 'end';
whileLoop: 'while' expression statement* 'end';
tryStatement: 'try' statement* ('catch' identifier? statement*)? ('finally' statement*)? 'end';
doStatement: 'do' parameterList statement* 'end';

// Lexer Rules
IDENTIFIER: [a-zA-Z_][a-zA-Z0-9_!]*;
INTEGER: [0-9]+ | '0x' [0-9a-fA-F]+ | '0b' [01]+;
FLOAT: [0-9]+ '.' [0-9]* ([eE] [+-]? [0-9]+)? | [0-9]+ [eE] [+-]? [0-9]+;
STRING: '"' .*? '"' | '"""' .*? '"""';
CHAR: '\'' . '\'' | '\'' '\\' . '\'';
BOOLEAN: 'true' | 'false';

COMMENT: '#' .*? '\r'? '\n' -> skip;
WS: [ \t\r\n]+ -> skip;

// Operators (with escaped ']')
OPERATOR: [+\-*/^%&|<>]=? | '==' | '!=' | '<=' | '>=' | '::' | '->' | '&&' | '||' | '<<' | '>>' | '>>>' | '.+' | '.-' | '.*' | './' | '.^' | '\\' | 'โˆ˜' | 'โ‹…' | 'ร—' | 'รท' | 'โˆช' | 'โˆฉ' | 'โˆš' | 'โˆ›' | 'โˆœ' | '\]';
COMPARISON_OPERATOR: '==' | '!=' | '<' | '<=' | '>' | '>=';

// Keywords
KEYWORD: 'function' | 'end' | 'if' | 'else' | 'elseif' | 'for' | 'while' | 'try' | 'catch' | 'finally' | 'return' | 'break' | 'continue' | 'struct' | 'mutable' | 'abstract' | 'type' | 'const' | 'let' | 'do' | 'macro' | 'quote' | 'where';

// Standard Library Functions
STDLIB_FUNCTION: 'println' | 'print' | 'push!' | 'pop!' | 'length' | 'sort' | 'map' | 'filter' | 'reduce' | 'sum' | 'maximum' | 'minimum';

I dont understand, why you want automated scraping. Isnt it done, once this small file is correct? I am confused about this

kaby76 commented 2 months ago

I responded with the feedback that you provided to me, and this is the result:

I don't understand why you want automated scraping. Isn't it done once this small file is correct?

The problem is it's never done. Every programming language has discrete releases. The grammar here has to be changed for each major release.

Also, there are several important things to consider for the scraping source.

The tree-sitter grammar for Julia is not versioned for each release of the language. The grammar is for GLR, not ALL(*). I can only find six test files (https://github.com/tree-sitter/tree-sitter-julia/tree/7223af20260119c12c8b546ad81165c0e8bcd4b0/test/corpus) for the tree-sitter grammar, so my confidence is low that it is correct. The parser at https://github.com/JuliaLang/JuliaSyntax.jl/tree/main/src is versioned, complete, correct, authoritative, and it is recursive descent (i.e., LL).

Generative AI is very likely not the best tool to generate an Antlr4 grammar for each release. Assuming Claude is used to generate a completely functional grammar, we can't easily replicate the process for a new version of the tree-sitter grammar or use a different generative AI tool. When I asked Claude to enumerate every refactoring it made, the output was a very imprecise description. I can't take the description that Claude outputs and replicate it in ChatGPT. Generative AI is currently only good for one-shot problems. I use ChatGPT and Copilot many times every day.

ShalokShalom commented 2 months ago

I see, thanks a lot!