antlr / grammars-v4

Grammars written for ANTLR v4; expectation that the grammars are free of actions.
MIT License
10.18k stars 3.71k forks source link

VBA grammar - line length #3838

Open ckrueger1979 opened 11 months ago

ckrueger1979 commented 11 months ago

Hi,

I think this grammar https://github.com/antlr/grammars-v4/blob/master/vba/vba.g4 has a problem with long lines.

This obfuscator https://github.com/oriolOrnaque/VBAObfuscator/ creates too long lines.

The length limit of a line is 1023 https://learn.microsoft.com/en-us/office/vba/language/reference/user-interface-help/line-too-long

I didn't find any reference to the line length in the grammar (see LINE_CONTINUATION and UNDERSCORE)

greetings Carsten

KvanTTT commented 11 months ago

Could you clarify the following things:

  1. What do you mean by long lines? Are they lines in generated lexer/parser?
  2. How does obfuscator relate to ANTLR grammar and generated code?
ckrueger1979 commented 11 months ago
  1. Lines that are longer then 1023 chars
  2. Take a long line of unobfuscated code, the obfuscator elongates the line above 1023 chars -> broken VBA

I would expect that the ANTLR parser shouldn't output illegal VBA code

kaby76 commented 11 months ago

I would expect that the ANTLR parser shouldn't output illegal VBA code

Please be precise. Antlr does not "output illegal VBA code." The job of Antlr is to parse input (valid or not), output error messages, and return a parse tree.

The place to add this check would be to override the Emit() method of the base class for the lexer. The method could check the start and stop indices of the token, call Lexer.Emit(), and report the error. We already do something like this in other grammars, e.g., lua. It's an easy fix. However, the change will mean the grammar must be split, and target-specific code added for each target.

ckrueger1979 commented 11 months ago

My compiler construction lecture was 20 years ago, sorry that I've mixed something up.

I thought the parser should be able to parse and emit only valid language and otherwise create an error.

PS: What do you mean with target specific code? Specific to VBA?

kaby76 commented 11 months ago

I thought the parser should be able to parse and emit only valid language and otherwise create an error.

Parsers do not emit code! A parser is a function with the signature boolean parse(string input)--it takes a string and outputs true if the string is valid in the language described by the grammar.

So,

parse("Public Sub Module()
    Dim sd As Boolean
End Sub")

returns true. It does not output VBA code.

What do you mean with target specific code? Specific to VBA?

Antlr generates a parser for the VBA grammar in a programming language that you compile and link into a program. The current targets are CSharp (C#), Cpp (C++), Dart2 (Dart), Go, Java, JavaScript, PHP, Python3, and TypeScript. If you don't tell the parser generator what target you want, it will output a parser in Java.

The generated parser code can reference other code that you write to support the parser. That support code has to be in the target programming language. If you generate the parser for C#, you have to write the support code in C#. This is important because you cannot use grammars that require support code in the Antlr Intellij extension, or lab.antlr.org.

ckrueger1979 commented 11 months ago

Thanks for the detailed explanation!

The parser for VBA will accept code with too long lines, correct? Return true even if the line is longer then 1023 chars

kaby76 commented 11 months ago

The parser for VBA will accept code with too long lines, correct? Return true even if the line is longer then 1023 chars

Yes, you are right. The parser for the VBA grammar accepts lines over 1023. I'll write a fix today or tomorrow.