Khazuar commented 4 years ago

Requirements

In our application we need the user to input math-formulas quite a lot, which are then parsed to semantic ASTs and used in further analysis and checking. This process needs to be highly robust and adaptable, and we need to be able to customize it for different users. We need both the ability to enter symbols and entire formulas via the virtual keyboard (mixed with the normal keyboard) and latex-commands for advanced users.

Current approach

Right now, we're using the $latex()-output of the mathfield and parse this using a family of EBNF grammars, which works rather well in general:

it makes it easy for us to extend the parser by simply supporting more latex
the parsers grammar can be altered for different clients/users
users can write latex in the mathfields and/or copy+paste from one field to another rather easily and intuitively

I was trying out the MASTON-output initially, but found it too limited.

Problem

In a lot of cases, the mathfields behave a little too "smart" and their contents are not what one would expect:

there are certain formatting operators added to it, like e.g. \mathop or \mathrm
sometimes the order of inputs is changed, like a_b^c becomes a^c_b. That can be compensated for in the parser, but it is annoying and this order feels less "semantic".
Text-areas get combined, so \mathbb{R}\mathbb{R} becomes \mathbb{RR}.

I see how these optimization increase the overall quality of the latex in general and make it better visually, but it makes it harder to work with it in a parser (or other automations). Users enter a=b and the field converts it to a\mathop{=}b, which looks nicer because of the improved spacing. But the parser needs to understand both a=b and a\mathop{=}b.

It also seemed these improvements are added over time, so with every new version of mathlive we need to check our application rather thoroughly.

Suggestion

It would be great if there was a way to get the content of the mathfield in a simplified format that is optimized for getting parsed. This could be another accessor like $simpleLatex(), an option for $latex() or something entirely different, that lets us work with the content more easily in an automated way. This doesn't need to solve all the problems I was listing above (there's no reason to write \mathbb{R}\mathbb{R} anyways), but it would be nice if this could prove the more stable and reliable interface.

I am also very open to suggestions how we could change our approach from our side.

arnog commented 4 years ago

Have you tried using mf.$text('json')?

Khazuar commented 4 years ago

I gave the json-output a go and found that it didn't parse a lot of things correctly, e.g. a+b\in\mathbb{N} (see the fiddle). This is the same as the MASTON-syntax I tried some time ago, right? In general I need to be able to parse depending on the math-context the user is in and I need to add new and sometimes complicated notations when necessary. Since this is a very special need, it's probably best if I do the parsing myself. I need reliable input for that though. This doesn't need to be latex of course, but I suspect the internal representation of the mathfield is even more complicated ;)

arnog commented 4 years ago

The intent of MathJSON/MASTON is to produce a structure that is both stable (i.e. independent of the rendering) and easy to parse, which seems to be your use case. I would rather fix the problems that exist in MathJSON/MASTON right now (like the one you mention, that's clearly a bug).

Khazuar commented 4 years ago

It's not just easy to parse, it's already parsed and that's my problem with MASTON. I'd rather have something like an array of tokens (["a", "plus", "b", "elementof", "mathbbStart", "N", "mathbbEnd"]) which doesn't try to guess the syntactic structure of the input yet. One case I'd find hard to solve is \frac{d}{dx} a + b:

if d is a known variable, this is "((d / (d x)) a) + b"
or it's "(a + b) derived for x"
or it's "(a derived for x) + b"

I doubt this can be solved well by a generic solution. I also wouldn't want to wait for a mathlive update every time I needed to change or add something there.

Khazuar commented 4 years ago

Or maybe something that only represents the syntactic structure of the symbols, without or with only optional meaning:

[
  { text: "a" },
  { superscript: [ "2" ] }
  { text: "+", meaning: "add" },
  { text: "b" },
  { superscript: [ "2" ] },
  { text: "\in", meaning: "elementof" },
  { 
    text: "\mathbb{N}", 
    meaning: "naturalNumbers", 
    innerToken: [
      { text: "N" }
    ]
  }
]

Something that is very generic and versatile, but robust, easy to implement in mathlive and doesn't need to be changed often.

arnog commented 4 years ago

OK, I get it. Fair point about the possible ambiguity of parsing in some cases.

(Although this ambiguity could be resolved with a semantic pass on the MathJSON ouput, i.e. it would return something like "(d / (d x)) (a + b)" and you could then transform it into something else based on the context)

Another approach is what you're currently doing, which is dealing with a Latex string and it might be the best, especially if your users can enter arbitrary Latex.

The structure you suggest would not be much of an improvement over Latex, and would have the same problems you point out at the beginning:

the order of subscript/superscript could be changed
some operators could be coalesced
additional formatting commands could be inserted Either way, this is a structure that you can produce from the Latex output already generated, it doesn't seem like you would need to access any of the Mathlive internals to do this

Khazuar commented 4 years ago

Yes, so right now I'm normalizing all the things mathlive does to the input latex first using replacements and a lexer. But it feels like this is something that mathlive could do itself, since it's applying all these changes in the first place. Hence my original idea with the simplified latex output.

arnog commented 4 years ago

I'm going to think about this some more. This is not as easy as it sounds :) By the time the output is requested, the original input is long gone and not easy to recover. I think the best path might be a version of MathJSON that doesn't apply any transformation rules (i.e. that would return just the "tokens").

NSoiffer commented 4 years ago

I like that idea best. It can be very useful to build in heuristics for semantics, but there are times when it will be wrong, particularly for specialized areas. Having a way to get at the expression before the semantics are inferred seems like a good idea.

On Wed, Oct 30, 2019 at 2:41 PM Arno Gourdol notifications@github.com wrote:

I'm going to think about this some more. This is not as easy as it sounds :) By the time the output is requested, the original input is long gone and not easy to recover. I think the best path might be a version of MathJSON that doesn't apply any transformation rules (i.e. that would return just the "tokens").

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/arnog/mathlive/issues/293?email_source=notifications&email_token=AALZM3EWWSM52AL5T75C6UDQRH5QJA5CNFSM4JG4VZTKYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOECV3Q6Q#issuecomment-548124794, or unsubscribe https://github.com/notifications/unsubscribe-auth/AALZM3HF37I6CPJPXH66YRLQRH5QJANCNFSM4JG4VZTA .

Joewings-jw commented 2 years ago

Is there a fix for this yet?In my use-case we're parsing the output from the mathfield to a function for analysing and the drawback is that the syntax of the latex is changed.

arnog commented 10 months ago

There are two ways to handle this:

when possible, mf.latex will return the "verbatim" latex, i.e. exactly as it was entered. However, if the content is changed via editing operations, there is no verbatim context and the result will be a serialized version of the content.
you can parse a LaTeX string to a non-canonical MathJSON which minimizes the interpretation of the LaTeX (for example it avoid inferring that "2x" is "2 times x"). Use ce.parse(latex, {canonical: false}) for this.

arnog / mathlive

Feature request: Simple latex content #293

Requirements

Current approach

Problem

Suggestion