brucemiller / LaTeXML

LaTeXML: a TeX and LaTeX to XML/HTML/ePub/MathML translator.
http://dlmf.nist.gov/LaTeXML/
Other
941 stars 101 forks source link

Smart ligatures for \mathrel and friends #316

Closed dginev closed 11 years ago

dginev commented 12 years ago

[Originally Ticket 1640]

TeX provides macros for declaring explicitly operator symbols, namely: \mathop, \mathrel, \mathbin, \mathopen, \mathclose, \mathpunct and \mathord.

LaTeXML handles them by creating an XMWrap element with an appropriate "role" attribute.

However, very frequently authors intend the arguments to these macros to be treated as single tokens, especially in the resulting XML representation (the original motivation for the ticket comes from our sTeX use cases).

I propose smart ligatures for these cases, where any XMWrap containing only a horizontal list of tokens with no vertical attachments (scripts, applications, etc) gets merged into a single token element.

The code below is tested and demonstrates the functionality intended:

# If we only have XMTok children and none of them has a role "script", concatenate:
DefMathRewrite(xpath=>'descendant-or-self::ltx:XMWrap[(@role="BIGOP" or @role="OP"
 or @role="ADDOP" or @role="BIGOP" or @role="OPEN" or @role="CLOSE" or @role="RELOP" 
 or @role="MULOP") and (not(child::*[local-name() != "XMTok"])) 
 and (not(child::*["SCRIPT" = substring(@role, string-length(@role) - 5)]))]',
 replace=>sub {
   my ($document,$node) = @_;
   my $replacement = $node->cloneNode(0);
   my $content = $node->textContent;
   $replacement->appendText($content);
   $replacement->setName('ltx:XMTok');
   $document->getNode->appendChild($replacement);
 });
dginev commented 12 years ago

The original use case we had was:

\mathrel{=:}

and the relevant discussion on the sTeX Trac can be seen at #168

brucemiller commented 12 years ago

Hmmm... maybe this would be better as the constructor for \mathop (& friends), rather than a rewrite. Excluding scripts from the children seems sensible, but I wonder. Primarily \mathop affects spacing, though it seems used rarely in wild code; it still invites some kind of abuse: "Space this like an operator". (done it myself :>)

I wonder if you'd see something like \mathop{(a+b)} (c) in some sort of compositional context? (with the expectation that the space between the composed function and c would be different than if it were invisible times).

I do understand your use case; I wonder if we need to be more careful about the rules.

dginev commented 12 years ago

How about we also exclude cases where tokens with role "UNKNOWN" and "NUMBER" are present?

Or, a more exotic idea, concatenate into a single token only when the math parsing fails, which would imply this was never a well-formed thing to be parsed to begin with?

brucemiller commented 12 years ago

Funny that you'd propose that: I had made a quick experiment where I wouldn't even try running the full parser on Wrapped things; just go directly to the kludge_parser, on the grounds they aren't meaningful notation anyway. Unfortunately, it induced too many changes in DLMF, meaning it either needs to be studied more carefully, or that it's wrong (or both).

Interesting, the experiment goes in the opposite direction from yours, avoiding the parsing rather than using it proactively (as a probe?). However, the reasons behind both may actually be the same. On the one hand, the wrapped things in DLMF are often more carefully constructed, so they still work. But on the other, a reason I wanted to try the kludge_parser was exactly to avoid getting invisible times between random pairs of wrapped tokens in "wilder" texts.

So, I think your goal is clearly right; just a question of how to get there.

brucemiller commented 11 years ago

Taking another peak at this, and the original ticket. I wonder that XMWrap is too late, and used for too many other things, but the fact that the author used \mathrel (or friend) ought to be a strong push.....

And now glancing at previous comments, I see that I said essentially the same thing before, but got sidetracked by your grammar comments.... And just sidetracked generally.

So, I think the best idea might be to formulate a scan of the argument of \mathrel (and similar rules for the others); if it meets that criterion, convert it to a single token, rather than wrapped.

If you have candidate rules, I'll work with that...

brucemiller commented 11 years ago

In the end, I adopted your solution; or pretty close. We want to only to act on XMWrap that have certain roles, and also only join tokens that have certain roles (almost the same set). So, the xpath actually gets slightly worse:

       # Only XMWrap's from the above class of operators
       .'(@role="OP" or @role="BIGOP" or @role="ADDOP" or @role="MULOP" '
       . 'or @role="OPEN" or @role="CLOSE" or @role="RELOP")'
       # with only XMTok as children with the roles in (roughly) the same set
       .' and not(child::*[local-name() != "XMTok"])'
       .' and not(ltx:XMTok['
       .   '@role !="OP" and @role!="BIGOP" and @role!="ADDOP" and @role!="MULOP" '
       .   'and @role!="OPEN" and @role!="CLOSE" and @role!="RELOP" and @role!="METARELOP"'
       .   '])]'

But, it seems to work!! Hope it works in your context! Thanks;