antlr / antlr4

ANTLR (ANother Tool for Language Recognition) is a powerful parser generator for reading, processing, executing, or translating structured text or binary files.
http://antlr.org
BSD 3-Clause "New" or "Revised" License
17.12k stars 3.28k forks source link

Improve ANTLR's error messages to be understandable by typical humans #3153

Open octogonz opened 3 years ago

octogonz commented 3 years ago

Summary

ANTLR's error handling is fairly good, but the error messages that it reports could be more friendly:

  1. ANTLR's messages describe the problem in terms of the parser's control flow, when they should be telling the user what was wrong with their input
  2. The messages use "robot voice" -- unfamiliar jargon, bizarre punctuation, and barely literate sentence fragments

As a motivating example, consider the JSON.g4 demo grammar. It is a very small grammar for a very well-behaved input language. It should be easy to parse. Let's look at the error messages that a user might be confronted with.

Case 1

Input containing a mistake: { true }

What ANTLR says:

line 1:2 no viable alternative at input '{true'

Problems:

Possible improved wording:

[line 1, col 2] The keyword "true" cannot be used here.

Case 2

Input containing a mistake: (an empty string)

What ANTLR says:

line 2:0 mismatched input '<EOF>' expecting {'{', '[', 'true', 'false', 'null', STRING, NUMBER}

Problems:

Possible improved wording:

[line 1, col 1] The input ended unexpectedly; expecting to see an expression such as "{", "[", or "true".

Case 3

Input containing a mistake: // comment

What ANTLR says:

line 1:0 token recognition error at: '/'
line 1:1 token recognition error at: '/'
line 1:3 token recognition error at: 'c'
line 1:4 token recognition error at: 'o'
line 1:5 token recognition error at: 'm'
line 1:6 token recognition error at: 'm'
line 1:7 token recognition error at: 'e'
line 1:8 token recognition error at: 'nt'
line 2:0 mismatched input '<EOF>' expecting {'{', '[', 'true', 'false', 'null', STRING, NUMBER}

Problems:

Possible improved wording:

[line 1, col 1] The character "/" cannot be used here.
[line 1, col 2] The letter "c" cannot be used here.
[line 2, col 1] The input ended unexpectedly; expecting to see an expression such as "{", "[", or "true".

Three options for improving this

Option A: Tweak the default generated messages

The above cases suggest some straightforward fixes to the default error messages:

  1. Replace robot voice like "no viable alternative" with intelligible sentences using commonplace English phrasing
  2. Replace programmer punctuation with normal English punctuation
  3. Measure column numbers correctly
  4. Implement a heuristic to suppress explosions of duplicate error messages
  5. Implement a heuristic to avoid long lists of suggested alternatives

These all seem like relatively easy fixes. Would you accept a PR to implement some of these ideas?

Option B: Tell people to roll their own ErrorListener

In theory, most of these problems can be solved by writing a bunch of custom code that hooks into the ErrorListener API. For that to be workable, we would need to greatly improve the documentation and examples. Doing this right is an advanced undertaking.

But doesn't that feel like a half-baked answer? Isn't it pushing responsibility to the end user, for a problem that really should be ANTLR's responsibility? The whole point of ANTLR is to avoid handcoding parser algorithms. So it seems there should be some declarative way to author contextual error messages in the grammar file.

Option C: Provide a declarative mechanism for customizing errors

I'm not sure how this would work exactly. Starting from some simple grammars, maybe we could work out a number of example inputs where the default messages are insufficient, and then provide examples of ErrorListener overrides that report a customized message. This exercise might motivate a natural way to declare these custom messages in the ANTLR grammar language.

KvanTTT commented 3 years ago

[line 1, col 1] The input ended unexpectedly; expecting to see an expression such as "{", "[", or "true".

There are not expressions, but words.