Mathics3 / mathics-scanner

Tokenizer, and character tables, operator precedence, and conversion routines for the Wolfram Language.
GNU General Public License v3.0
17 stars 3 forks source link

The parser should convert NamedCharacters into wl-code as before (1.2.4), not in unicode-equivalent #42

Closed mmatera closed 2 years ago

mmatera commented 2 years ago

After the last release, the behavior of MathicsScanner changed, in a way that named characters in strings are mapped to unicode-equivalent instead of wl-code as before. After fighting with the formatter code in mathics-core, I think this behavior is wrong. The reason is that the goal of having unicode-equivalent is to provide a readable output, not to have an efficient way to store characters.

The example comes up with "\[DifferentialD]". In 1.2.4, this string was parsed as "\u7f4c", which was a WL specific character, with a specific meaning. If the string has a form like "\[Integral]F[x]\[DifferentialD] x", the string can be parsed afterward as the expression Integrate[F[x], x]. On the other hand, if we want to produce a printable version, \[DifferentialD] could be converted into d, or \u0001D451 or \, d, according to the place we need it.

With the current behavior in master, the test/format/test_format.py tests in mathics-core fails.

rocky commented 2 years ago

Investigating this, this appears to be a Boxing issue, not a scanning or parsing issue. The scanner allows three kinds of input:

As a M-Expression this is properly turned into an Infix operator .

It is then format's job, neither the scanner's nor parser's job to take this correctly tagged node in conjunction with the current $CharacterEncoding value and turn this into the right symbol. Possibly a function similar to FromCharacterCode[] can be used here. The problem with FromCharacterCode[] is that we need to convert a named character into the right code based on $CharacterEncoding. We can add special ASCII operator to WMA or standard Unicode if need be. However we just need to find the right sequence in WMA speak to get this done.