NSoiffer / MathCAT

MathCAT: Math Capable Assistive Technology for generating speech, braille, and navigation.
MIT License
63 stars 35 forks source link

Panic related to unicode characters #260

Closed ajirving closed 7 months ago

ajirving commented 7 months ago

I've encountered an input which causes a panic with an error:

byte index 1 is not a char boundary; it is inside '𝟏' (bytes 0..4) of `𝟏` 

The Mathml is below. I've trimmed it down as much as I can. The error message seems to suggest the issue relates to the bold superscript 1. I removed the bold on that character and then it works. However I also tried removing other unrelated parts of the formula and it also works so I don't really understand the cause.

Here's the Mathml:

<math>
  <msup>
    <mi>H</mi>
    <mrow>
      <mrow>
        <mn mathvariant="bold">1</mn>
      </mrow>
    </mrow>
  </msup>
  <mrow data-mjx-texclass="INNER">
    <mo data-mjx-texclass="OPEN">(</mo>
    <mi>G</mi>
  </mrow>
</math>"#;
NSoiffer commented 7 months ago

Thanks for the bug report. I can replicate the problem.

It happens because of the bold '1'. MathCAT has code that tries to determine if some numbers have block separators and it appears that the code assumes the digits are all ASCII. So looking at the length (which is 4 bytes) triggers that block separator check code, but slicing at '3' to look for a block separator causes the problem.

I'll get a fix into the next build and add a test for this case.

NSoiffer commented 7 months ago

I was incorrect about the location -- it is in the code for speaking ordinal numbers. There is a separate bug (#261) for canonicalization.

I have a fix for this bug that I will commit shortly.