asciidoctor / asciimath

Asciimath parser
MIT License
24 stars 18 forks source link

Buggy tokenisation of text(), font() #7

Closed opoudjis closed 6 years ago

opoudjis commented 6 years ago

The contents of text() and font() are being parsed with the same tokenisation of as the rest of AsciiMath.

As a result, text("K") is being rendered as <mtext><mtext>K</mtext></mtext>: text renders as <mtext>...</mtext>, and "K" as <mtext>K</mtext>.

Similarly, other text within text() and font() is being tokenised, which introduces whitespaces: text(“K”) (with smart quotes) is being rendered as <mtext><mrow><mi>\u201C</mi><mi>K</mi><mi>\u201D</mi></mrow></mtext>

And of course, if any of the text happens to be an AsciiMath instruction, it is treated as an AsciiMath instruction: text(int x) renders as <mtext><mrow><mo>&#x222B;</mo><mi>x</mi></mrow></mtext>. In a text we were about to go live with, text("AA") is ending up being rendered as ∀!

This is completely wrong. The contents of text() and font() are text, and must be treated as text. The culprit is the code:

def parse_simple_expression(tok, depth)
      t1 = tok.next_token
...
when :unary, :font
          s = parse_simple_expression(tok, depth)
          {:type => t1[:type], :s => s, :operator => t1[:value]}

That is only valid where t1[:value] is sqrt. For text and :font, a different tokeniser needs to kick in, which grabs all text between the following :lparen token and its matching :rparen token, and returns it to :s as just text. Something like:

when :unary_text, :font
          s = parse_text_expression(tok)
....

def parse_text_expression(tok)
  ret = ""
  while(t1 = tok.next_token)
    case t1[:type]
      when :lparen, :lrparen
        ret = ""
      when :rparen, :rlparen
        break
     else
       ret << t1.text # doesn't exist yet: get the literal token, not its symbol lookup
     end
     ret
end
pepijnve commented 6 years ago

Thanks for testing and reporting this so extensively! Simple oversight in the parser indeed. I’ll try to get this fixed ASAP.

pepijnve commented 6 years ago

Looking at this in a bit more detail on a laptop now instead of on my phone. So the root issue here is that I never really got round to properly implementing the TeX alternative part of asciimath. The fact that text is being picked up to begin with is probably more by accident than intentional.

That being said, the intention is of course to achieve parity with the js code at some point in this gem. Not sure when/if I'll actually have time to implement this myself. If you would like to contribute to the project, PRs are more than welcome of course.

Short term, I'll have a look at getting at least text fixed. Not sure what to do with font though. I can't find any grammar for that on http://asciimath.org or in the source code itself https://github.com/asciimath/asciimathml/blob/master/ASCIIMathML.js. Could describe what you expect this to do exactly?

opoudjis commented 6 years ago

For our purposes, font() is not needed: we are using this to render AsciiMath in Word, via the chain AsciiMath > MathML > OOXML. We just need text() working. Of course, the content of font() is processed in the same way as text(), so even if it doesn't render correctly in presentation, at least it won't have mangled AsciiMath in it.

I would PR, but I am deluged; what I sketched above is as far as I could get (and in fact I haven't had time to check my notifications before now). But thank you for looking at it!

opoudjis commented 6 years ago

I will test our side (when I get a minute, may be a couple of weeks—sorry), and let you know how it does in our tool chain.

pepijnve commented 6 years ago

Do you have any spec reference for font()? I would be happy to ADD it but need to know how it’s supposed to work.

opoudjis commented 6 years ago

All I know unfortunately is what http://asciimath.org says, which is that the unary operators for fonts are:

bb "AaBbCc" : boldface
bbb "AaBbCc"  : doublestruck
cc "AaBbCc" : script
tt "AaBbCc" : typewriter
fr "AaBbCc" : fraktur
sf "AaBbCc" : sans-serif  

The grammar at the bottom of the page says as much:

u ::= sqrt | text | bb | other unary symbols for font commands
opoudjis commented 6 years ago

.... Which by the way indicates that font() is not an AsciiMath operator. (And given that font() doesn't say which font, it can't be one.) It's presumably a short-hand notation for "bb or bbb or cc or tt or fr or sf" in a grammar somewhere.

pepijnve commented 6 years ago

Which by the way indicates that font() is not an AsciiMath operator

That's the impression I get as well. That's why it's not even present in the parser and gets treated like a generic function name. It's easy enough to add special logic for this once I know what that's supposed to be.

Do you know of any other asciimath parser implementations that treat this differently? Asciimath.org turns font(foo) into

<math xmlns="http://www.w3.org/1998/Math/MathML">
  <mstyle displaystyle="true">
    <mi>f</mi>
    <mi>o</mi>
    <mi>n</mi>
    <mi>t</mi>
    <mrow>
      <mo>(</mo>
      <mi>f</mi>
      <mo>&#x221E;</mo>
      <mo>)</mo>
    </mrow>
  </mstyle>
</math>

so it looks like the canonical implementation doesn't have any special treatment for font() either.

opoudjis commented 6 years ago

https://runarberg.github.io/ascii2mathml/ adds rm, bf, it, with bb for doublestruck instead of boldface. Everyone else that I've Googled has the same commands as asciimath's home page; I see no evidence anywhere that font() exists; and again, I don't see how it can exist. ("Put this text in font". What font? It can only make any sense as a binary operator, with a text argument and a style argument.)

pepijnve commented 6 years ago

Ok. Closing this for the time being then. Just FYI, the gem supports (or at least should) the following font commands:

'bb' => {:value => :bold, :type => :font},
'bbb' => {:value => :double_struck, :type => :font},
'ii' => {:value => :italic, :type => :font},
'bii' => {:value => :bold_italic, :type => :font},
'cc' => {:value => :script, :type => :font},
'bcc' => {:value => :bold_script, :type => :font},
'tt' => {:value => :monospace, :type => :font},
'fr' => {:value => :fraktur, :type => :font},
'bfr' => {:value => :bold_fraktur, :type => :font},
'sf' => {:value => :sans_serif, :type => :font},
'bsf' => {:value => :bold_sans_serif, :type => :font},
'sfi' => {:value => :sans_serif_italic, :type => :font},
'sfbi' => {:value => :sans_serif_bold_italic, :type => :font},
opoudjis commented 5 years ago

The fix is a-ok, and thank you! Would it be possible for you to do a release? We depend on the fix in our production gem suite, and Asciimath is embedded a little too deep for us to rely on gemfile to pull the fix from github.

pepijnve commented 5 years ago

I've just built and published version 1.0.7. That includes this fix.

mojavelinux commented 5 years ago

I arrived here after the assertions in Asciidoctor started failing. Now it seems like all non-ASCII characters and XML special characters are being encoded as hex character references. In other words, instead of &lt;, we get &#x003C;.

UPDATE: Actually, it's the opposite of what I said ^

The line that introduces this change is this one:

text.encode(Encoding::US_ASCII, :xml => :text)

I think AsciiMath should honor the encoding of the input string:

text.encode(:xml => :text)

or

text.encode(text.encoding, :xml => :text)

Otherwise, we end up with mixed encodings.

mojavelinux commented 5 years ago

Actually, I think this may be safe after all. It does strike me as something that could be configurable though. But it's not as much of a showstopper as I had originally thought.

mojavelinux commented 5 years ago

The main change was that &#x00B1; now becomes &#xB1; (as a result of using Ruby's XML encoding).

pepijnve commented 5 years ago

@mojavelinux agreed, the encoding should be configurable. My rationale here was that 7-bit ASCII and anything outside that as unicode code point literals would most likely be safe. But now that I think about it again that doesn't handle fixed width multi-byte encodings (e.g., UTF-16) correctly. I'll log an issue for this and get it fixed.

pepijnve commented 5 years ago

@mojavelinux on second thought, the way this is used is actually safe and does achieve the desired effect. The escaped content is append to another string and String#<< ensures encodings are consistent, converting the appended bit if necessary. It's up to the consumer of the returned mathml or html string to choose the final external encoding.

The reason I went for escaping anything that's not in 7-bit ASCII is to avoid possible mixups with the various mathematical symbols. The unicode code points are easier to check for correctness.

mojavelinux commented 5 years ago

Yep, after more thought, I agree it's safe.

We might want to consider allowing the symbols to be encoded in UTF-8, but that's really a separate "nice to have" request.

Tests are all green!