ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
ANTLR's error handling is fairly good, but the error messages that it reports could be more friendly:
ANTLR's messages describe the problem in terms of the parser's control flow, when they should be telling the user what was wrong with their input
The messages use "robot voice" -- unfamiliar jargon, bizarre punctuation, and barely literate sentence fragments
As a motivating example, consider the JSON.g4 demo grammar. It is a very small grammar for a very well-behaved input language. It should be easy to parse. Let's look at the error messages that a user might be confronted with.
Case 1
Input containing a mistake:{ true }
What ANTLR says:
line 1:2 no viable alternative at input '{true'
Problems:
For an ordinary person, no viable alternative is an absolutely meaningless phrase. Alternative to what? The word "viable" is somewhat obscure in English and probably outside the vocabulary of a non-native speaker.
no viable alternative is not a complete sentence, or even a natural way of talking
This is a bit pedantic, but the input was { true not {true -- why was the space omitted?
The string {true is quoted using apostrophes (') instead of double quotes ("). This is common in certain programming languages, but standard American English always uses double quotes.
Possible improved wording:
[line 1, col 2] The keyword "true" cannot be used here.
0 is a nonexistent column number. In every single popular text editor today, the file starts with line #1 and column #1.
The string '<EOF>' is awkward. We can expect that an old-school software engineer will remember that EOF is short for "end of file." But that's not the audience for this error message. A person writing a JSON data file may have no idea what EOF means, especially in situations where there is no "file" involved.
The "EOF" symbol is both enclosed in angle brackets <> and also quoted (why??)
What does "mismatched input" mean? Is that a bad thing? What would an empty input be expected to "match"? "Mismatched" probably has some special meaning for grammaticians, but for an ordinary person we might as well have said "bad input" or "missing input"
Calling out every single allowable token '{', '[', 'true', 'false', 'null', STRING, NUMBER is possibly useful -- but it is confusing to use the notation { X, Y, Z } to group three alternatives X, Y, and Z when one of those alternatives is a { character
Possible improved wording:
[line 1, col 1] The input ended unexpectedly; expecting to see an expression such as "{", "[", or "true".
Case 3
Input containing a mistake:// comment
What ANTLR says:
line 1:0 token recognition error at: '/'
line 1:1 token recognition error at: '/'
line 1:3 token recognition error at: 'c'
line 1:4 token recognition error at: 'o'
line 1:5 token recognition error at: 'm'
line 1:6 token recognition error at: 'm'
line 1:7 token recognition error at: 'e'
line 1:8 token recognition error at: 'nt'
line 2:0 mismatched input '<EOF>' expecting {'{', '[', 'true', 'false', 'null', STRING, NUMBER}
Problems:
For a regular person, token recognition error is an absolutely meaningless phrase. What the heck is a "token"? In the ANTLR world perhaps there is an important technical difference between "recognition" versus "matching", but this nuance will be lost on the unsuspecting person who was writing some JSON and didn't realize that comments aren't allowed.
The error gets reported redundantly for each character -- except for nt which is oddly grouped together (why??)
Again, the column numbers are all calculated incorrectly -- column numbers start at #1!
Possible improved wording:
[line 1, col 1] The character "/" cannot be used here.
[line 1, col 2] The letter "c" cannot be used here.
[line 2, col 1] The input ended unexpectedly; expecting to see an expression such as "{", "[", or "true".
Three options for improving this
Option A: Tweak the default generated messages
The above cases suggest some straightforward fixes to the default error messages:
Replace robot voice like "no viable alternative" with intelligible sentences using commonplace English phrasing
Replace programmer punctuation with normal English punctuation
Measure column numbers correctly
Implement a heuristic to suppress explosions of duplicate error messages
Implement a heuristic to avoid long lists of suggested alternatives
These all seem like relatively easy fixes. Would you accept a PR to implement some of these ideas?
Option B: Tell people to roll their own ErrorListener
In theory, most of these problems can be solved by writing a bunch of custom code that hooks into the ErrorListener API. For that to be workable, we would need to greatly improve the documentation and examples. Doing this right is an advanced undertaking.
But doesn't that feel like a half-baked answer? Isn't it pushing responsibility to the end user, for a problem that really should be ANTLR's responsibility? The whole point of ANTLR is to avoid handcoding parser algorithms. So it seems there should be some declarative way to author contextual error messages in the grammar file.
Option C: Provide a declarative mechanism for customizing errors
I'm not sure how this would work exactly. Starting from some simple grammars, maybe we could work out a number of example inputs where the default messages are insufficient, and then provide examples of ErrorListener overrides that report a customized message. This exercise might motivate a natural way to declare these custom messages in the ANTLR grammar language.
Summary
ANTLR's error handling is fairly good, but the error messages that it reports could be more friendly:
As a motivating example, consider the JSON.g4 demo grammar. It is a very small grammar for a very well-behaved input language. It should be easy to parse. Let's look at the error messages that a user might be confronted with.
Case 1
Input containing a mistake:
{ true }
What ANTLR says:
Problems:
no viable alternative
is an absolutely meaningless phrase. Alternative to what? The word "viable" is somewhat obscure in English and probably outside the vocabulary of a non-native speaker.no viable alternative
is not a complete sentence, or even a natural way of talking{ true
not{true
-- why was the space omitted?{true
is quoted using apostrophes ('
) instead of double quotes ("
). This is common in certain programming languages, but standard American English always uses double quotes.Possible improved wording:
Case 2
Input containing a mistake: (an empty string)
What ANTLR says:
Problems:
'<EOF>'
is awkward. We can expect that an old-school software engineer will remember that EOF is short for "end of file." But that's not the audience for this error message. A person writing a JSON data file may have no idea what EOF means, especially in situations where there is no "file" involved.<
>
and also quoted (why??)'{', '[', 'true', 'false', 'null', STRING, NUMBER
is possibly useful -- but it is confusing to use the notation{ X, Y, Z }
to group three alternativesX
,Y
, andZ
when one of those alternatives is a{
characterPossible improved wording:
Case 3
Input containing a mistake:
// comment
What ANTLR says:
Problems:
token recognition error
is an absolutely meaningless phrase. What the heck is a "token"? In the ANTLR world perhaps there is an important technical difference between "recognition" versus "matching", but this nuance will be lost on the unsuspecting person who was writing some JSON and didn't realize that comments aren't allowed.nt
which is oddly grouped together (why??)Possible improved wording:
Three options for improving this
Option A: Tweak the default generated messages
The above cases suggest some straightforward fixes to the default error messages:
These all seem like relatively easy fixes. Would you accept a PR to implement some of these ideas?
Option B: Tell people to roll their own
ErrorListener
In theory, most of these problems can be solved by writing a bunch of custom code that hooks into the
ErrorListener
API. For that to be workable, we would need to greatly improve the documentation and examples. Doing this right is an advanced undertaking.But doesn't that feel like a half-baked answer? Isn't it pushing responsibility to the end user, for a problem that really should be ANTLR's responsibility? The whole point of ANTLR is to avoid handcoding parser algorithms. So it seems there should be some declarative way to author contextual error messages in the grammar file.
Option C: Provide a declarative mechanism for customizing errors
I'm not sure how this would work exactly. Starting from some simple grammars, maybe we could work out a number of example inputs where the default messages are insufficient, and then provide examples of
ErrorListener
overrides that report a customized message. This exercise might motivate a natural way to declare these custom messages in the ANTLR grammar language.