antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.11k stars 3.28k forks source link

The generated comment unicode-escapes the last backslash for windows paths #2569

Open Vampire opened 5 years ago

Vampire commented 5 years ago

The generated files here have this comment:

// Generated from net\kautler\command\u005Cusage\Usage.g4 by ANTLR 4.7.2

but I think it should actually be

// Generated from net\kautler\command\usage\Usage.g4 by ANTLR 4.7.2
sharwell commented 5 years ago

What target language is this? Java requires the escape (#164).

Vampire commented 5 years ago

Ah, right, Unicode escape, but I think then either double the backslashes to not need to use a Unicode escape for the "u", or use forward slashes. With the Unicode escape it reads totally strange. :-)

sharwell commented 5 years ago

Double backslashes works in a string literal, but would have a different meaning in a comment. This is the only way to represent this Windows-style path in a Java comment.

Vampire commented 5 years ago

I'm sorry to have to tell you, but you are wrong.

Excerpt from the Java Language Specification chapter "3.3 Unicode Escapes":

In addition to the processing implied by the grammar, for each raw input character that is a backslash \, input processing must consider how many other \ characters contiguously precede it, separating it from a non-\ character or the start of the input stream. If this number is even, then the \ is eligible to begin a Unicode escape; if the number is odd, then the \ is not eligible to begin a Unicode escape. If an eligible \ is not followed by u, then it is treated as a RawInputCharacter and remains part of the escaped Unicode stream. If an eligible \ is followed by u, or more than one u, and the last u is not followed by four hexadecimal digits, then a compile-time error occurs.

And to make super sure I even tried,

// Generated from net\kautler\command\\usage\Usage.g4 by ANTLR 4.7.2

compiles perfectly fine. And so does

// Generated from net\\kautler\\command\\usage\\Usage.g4 by ANTLR 4.7.2

While I still think always using forward slashes might be the nicest and most readable solution

// Generated from net/kautler/command/usage/Usage.g4 by ANTLR 4.7.2
sharwell commented 5 years ago
// Generated from net\kautler\command\\usage\Usage.g4 by ANTLR 4.7.2

There are two different kinds of \ in this line:

  1. The first, second, third, and fifth \ are eligible \ which are not followed by u, and therefore treated as RawInputCharacter and remain part of the escaped Unicode stream
  2. The fourth \ is not an eligible \, and therefore is also treated as RawInputCharacter and remains part of the escaped Unicode stream

De-duplication of escaped backslashes in strings is covered by EscapeSequence in section 3.10.6. However, no such de-duplication applies to comments in source code (section 3.7).

This leaves us with the weird situation current produced by the tool: the only way to represent a single \ followed by a lowercase u in a Java source code comment is to use a Unicode escape for the \ character.

Vampire commented 5 years ago

I don't really get your point. Of course there is no "de-duplication" of the backslash. Which also is not necessary, as it is only a comment, not something that is interpreted for runtime like a String where you need a way to express one backslash character.

So having two backslash characters in a comment is of course different from having two backslash characters in a String, but it is still a valid way to have a u after a backslash character.

Or you follow my other suggestion and just use forward slashes.


Or let me throw in another question. What is the comment meant for? My guess is, for a human reading it to see from which file it was generated. And a human has significant problems to read this string if there is a unicode escape sequence in it.

sharwell commented 5 years ago

I don't really get your point.

I'm just explaining how it came to pass that the code produces this result. It probably won't change mostly because the product is in maintenance mode and the current implementation isn't broken (even if it's ugly).

Vampire commented 5 years ago

I don't really get your point.

I'm just explaining how it came to pass that the code produces this result.

Ah, ok, that's clear to me of course. :-)

It probably won't change mostly because the product is in maintenance mode and the current implementation isn't broken (even if it's ugly).

That's very sad. :-(

sschuberth commented 1 year ago

BTW, the "illegal unicode escape" issue seems to be back with ANTLR 4.12 as I'm getting

C:\Users\Sebastian\Development\GitHub\oss-review-toolkit\ort\utils\spdx\build\generated-src\antlr\main\org\ossreviewtoolkit\utils\spdx\SpdxExpressionVisitor.java:1: error: illegal unicode escape
// Generated from org\ossreviewtoolkit\utils\spdx\SpdxExpression.g4 by ANTLR 4.12.0
sschuberth commented 1 year ago

Ah, sorry, I just realized this issue is actually complaining about the Unicode encoding (which is actually required to make the "illegal unicode escape" disappear). I should have re-opened https://github.com/antlr/antlr4/issues/164 instead.