Parser doesn't give error when encountering unknown characters

bengolds commented 2 years ago

When encountering most special characters, the parser acts as if it hasn't run into any problem at all:

All of the below simply parse as 'x':
x#y
x&y
x$y
x@y
x?y
... and probably others I can't think of.

It'd be great if the parser returned an error like, "Unrecognized symbol: ?" that we could work with to display a better error to the user.

arnog commented 2 years ago

At first glance, what happens is that the parser encounters these characters that do not map to any known dictionary definition (i.e. the parser don't know if they're supposed to be operators or something else), and indeed stop parsing. It should signal that the parsing is incomplete, and if it doesn't that is indeed a bug.

However, the input string is interpreted as valid LaTeX, and some characters may have unexpected results. For example, % is the "start of comment" character, i.e. anything after this character is ignored. Some other characters that have special meaning for LaTeX include { } [ ] $ and \, so if this input is coming directly from a user (as a variable name, for example), it might be worthwhile to do a first sanitizing pass before calling parse().

bengolds commented 2 years ago

How should I expect to get the signal that parsing is incomplete? As part of the return value, or as a separate signal?

arnog commented 2 years ago

Right now, when a syntax-error error is returned, it (should) contain the portion that was not parsed. The current implementation is deficient, however. I will improve it as part of addressing this issue. My plan is that if the parsing runs into an unexpected operator, it would return something like this, assuming the input is x@y:

["Error", "x", "syntax-error", ["LatexForm", "@y"]]

The LatexForm expression indicate the fragment of LaTeX that could not be parsed. The first argument of Error ("x") is the part that could be parsed (it could also be a substitute value, depending on the severity of the failure).

When evaluated, the Error function returns this first argument. So the end result of evaluating this whole expression would be x, consistent with the "maximum effort" doctrine, but still preserving the information that an error did occur.

Note that you can have more than one Error expression, depending on how succesful the parsing recovery was (i.e. if it recovers, it can fail again later). For example: \frac{x@y}{a@b} :

["Divide", 
    ["Error", "x", "syntax-error", ["LatexForm", "@y"]],
    ["Error", "a", "syntax-error", ["LatexForm", "@b"]]
]

If you want to get rid of the errors, and just have a "cleaned up" expression, you simply evaluate it: expr.evaluate() -> ["Divide", "x", "a"]. If you serialize it to LaTeX, without evaluating it first, the LaTeX will highlight the error:

\frac {a  \texttt{\textcolor{red}{@y}} }  {b \texttt{\textcolor{red}{@b}} }

[edited to clarify that the first argument of Error would be the portion of the parsing that was succesful]

strickinato commented 2 years ago

Sure enough! I think this can probably be closed, and if there are other deficiencies we (speaking for @bengolds here), we can open a new ticket 🙏

> c.parse("x@y").json
[ 'Error', 'x', "'syntax-error'", [ 'LatexForm', "'@y'" ] ]
> c.parse("x&y").json
[ 'Error', 'x', "'syntax-error'", [ 'LatexForm', "'&y'" ] ]
> c.parse("x$y").json
[ 'Error', 'x', "'syntax-error'", [ 'LatexForm', "'$y'" ] ]
> c.parse("x?y").json
[ 'Error', 'x', "'syntax-error'", [ 'LatexForm', "'?y'" ] ]

cortex-js / compute-engine

Parser doesn't give error when encountering unknown characters #31