Mathics3 / mathics-scanner

Tokenizer, and character tables, operator precedence, and conversion routines for the Wolfram Language.
GNU General Public License v3.0
17 stars 3 forks source link

revert wl-code->unicode-equivalent in parsing. Adding amstex tables #43

Closed mmatera closed 2 years ago

mmatera commented 2 years ago

This PR adds the table of amstex symbols, and fixes #42

rocky commented 2 years ago

Thanks for making the addition for amslatex.

As for differential I will look at this when I get a chance. I need to understand why the change was made. Note that tables are used by a number of front ends, and not just used in mathics core.

rocky commented 2 years ago

Today I was running tests in mathics core and I am getting a failure in test/format/test_format.py:

test/format/test_format.py:687: AssertionError
_ test_makeboxes_text[Integrate[F[x], {x, a, g[b]}]-Subsuperscript[\u222b, a, g(b)]\u2062F(x)\u2062\uf74cx-form53-Non trivial SubsuperscriptBox] _

str_expr = 'Integrate[F[x], {x, a, g[b]}]'
str_expected = 'Subsuperscript[∫, a, g(b)]\u2062F(x)\u2062\uf74cx', form = <Symbol: System`TraditionalForm>
msg = 'Non trivial SubsuperscriptBox'

    @pytest.mark.parametrize(
        ("str_expr", "str_expected", "form", "msg"),
        mandatory_tests,
    )
    def test_makeboxes_text(str_expr, str_expected, form, msg):
        result = session.evaluate(str_expr)
        format_result = result.format(session.evaluation, form)
        if msg:
>           assert (
                format_result.boxes_to_text(evaluation=session.evaluation) == str_expected
            ), msg
E           AssertionError: Non trivial SubsuperscriptBox
E           assert 'Subsuperscri...2F(x)\u2062𝑑x' == 'Subsuperscri...\u2062\uf74cx'
E             - Subsuperscript[∫, a, g(b)]⁢F(x)⁢x
E             ?                                 ^
E             + Subsuperscript[∫, a, g(b)]⁢F(x)⁢𝑑x
E             ?                                 ^

And if I read this right, it is saying that it got the standard Unicode differential "d", but the WL differential d is what was coded is expected in the test. So here, I am inclined to say the test expectation needs adjusting and the current mathics scanner behavior is the better one.

Looking at where behavior starts to be different, commit df948fe44df1999ab682d1e70c1541f6fb2dcb4a is the last "good" commit and that commit 8a2dc4a298d7afe542c8836ff07a2aeeb8c01575 ,"More letter-like work"(with massive amounts of changes) ,is the first "bad" commit, although in my view it is the other way around. More letter-like-work outputs Unicode-rendered "d" while before that we get WL-rendered "d".

I don't want to deal with this now having spent too much time on the weekend looking at this stuff which is not what i had planned to do. But I did want to report what I have found.

mmatera commented 2 years ago

Today I was running tests in mathics core and I am getting a failure in test/format/test_format.py:

test/format/test_format.py:687: AssertionError
_ test_makeboxes_text[Integrate[F[x], {x, a, g[b]}]-Subsuperscript[\u222b, a, g(b)]\u2062F(x)\u2062\uf74cx-form53-Non trivial SubsuperscriptBox] _

str_expr = 'Integrate[F[x], {x, a, g[b]}]'
str_expected = 'Subsuperscript[∫, a, g(b)]\u2062F(x)\u2062\uf74cx', form = <Symbol: System`TraditionalForm>
msg = 'Non trivial SubsuperscriptBox'

    @pytest.mark.parametrize(
        ("str_expr", "str_expected", "form", "msg"),
        mandatory_tests,
    )
    def test_makeboxes_text(str_expr, str_expected, form, msg):
        result = session.evaluate(str_expr)
        format_result = result.format(session.evaluation, form)
        if msg:
>           assert (
                format_result.boxes_to_text(evaluation=session.evaluation) == str_expected
            ), msg
E           AssertionError: Non trivial SubsuperscriptBox
E           assert 'Subsuperscri...2F(x)\u2062𝑑x' == 'Subsuperscri...\u2062\uf74cx'
E             - Subsuperscript[∫, a, g(b)]⁢F(x)⁢x
E             ?                                 ^
E             + Subsuperscript[∫, a, g(b)]⁢F(x)⁢𝑑x
E             ?                                 ^

And if I read this right, it is saying that it got the standard Unicode differential "d", but the WL differential d is what was coded is expected in the test. So here, I am inclined to say the test expectation needs adjusting and the current mathics scanner behavior is the better one.

Looking at where behavior starts to be different, commit df948fe is the last "good" commit and that commit 8a2dc4a ,"More letter-like work"(with massive amounts of changes) ,is the first "bad" commit, although in my view it is the other way around. More letter-like-work outputs Unicode-rendered "d" while before that we get WL-rendered "d".

I don't want to deal with this now having spent too much time on the weekend looking at this stuff which is not what i had planned to do. But I did want to report what I have found.

I found a similar problem some time ago (see #284 in Mathics-Core)

Probably, at the level of pytests in mathics-core, the formatter tests should check just the ASCII (even maybe ANSI) encoded output. But for that, we should go over the expected behavior of mathics-scanner. When I have some time, I will try to write down how I think this should go. However, I would like first to finish with the format refactor, and then face this other aspect.

rocky commented 2 years ago

I spent some time today going over the tables in https://github.com/Mathics3/mathics-scanner/pull/51

I am ready to answer any questions you have about this stuff. Note that in this PR the standard unicode for DifferentialD is \U0001D451 and that is not the same thing as \u1d451 .

This works for me as is in Mathics-core. However there is work to be done in mathics-core to get operators to display properly, and this is neither the role of either the scanner nor parser.

I can't work on this much on Sunday.

Basically the flow is that on input we should accept WL unicode and standard Unicode. In mathics core in after parsing there is no Unicode or ASCII used per se for operators, we just have a node of the operator type, e.g. "And".

Right now in formatting rules we are picking out the object's "operator" string value which is wrong. Instead we should be using the named character which we can now get based on the ascii operator or one of the character codes, and with $CharacterEncoding turn this into the right kind of string.

mmatera commented 2 years ago

I spent some time today going over the tables in #51

@rocky, thank you for investigating and fixing this.

I am ready to answer any questions you have about this stuff. Note that in this PR the standard unicode for DifferentialD is \U0001D451 and that is not the same thing as \u1d451 .

This works for me as is in Mathics-core. However there is work to be done in mathics-core to get operators to display properly, and this is neither the role of either the scanner nor parser.

I can't work on this much on Sunday.

Basically the flow is that on input we should accept WL unicode and standard Unicode. In mathics core in after parsing there is no Unicode or ASCII used per se for operators, we just have a node of the operator type, e.g. "And".

Right now in formatting rules we are picking out the object's "operator" string value which is wrong. Instead we should be using the named character which we can now get based on the ascii operator or one of the character codes, and with $CharacterEncoding turn this into the right kind of string.

During the week, I am going to try fixing this in mathics-core.

mmatera commented 2 years ago

Superseeded by #51

rocky commented 2 years ago

During the week, I am going to try fixing this in mathics-core.

@mmatera Good. Here are current thoughts.

Based on https://mathematica.stackexchange.com/a/3628/73996 ,

if you have the name of the operator, let us say it is "LeftVector", then:

ToExpression["\"\\[LeftVector]\""]

will produce its (Standard) unicode equivalent. But we want to do that only when $CharacterEncodingis not "ASCII" (and what else); when it is "ASCII" , we just want the class "operator" value. "operator" is a vague name to represent "the ASCII character string sequence for the given operator".

Based on the Mathematica StackExchange answer, I suspect that there is no built-in WMA to do this kind of conversion based on $CharacterEncoding (or anything similar). So the conversion would have to be wrapped in an WMA If[] and then it would be good to create a non-WMA function for this.

But if we add non-WMA built-in function to do thi, then ToExpression[] is not needed. And the implementation can be much more efficient since it doesn't go through the parser, but instead it could just consult conversion tables produced by the mathics_scanner project.

Note that in contrast to what we do now where operator conversion , e.g. And" node to "&&", is done once, when done properly it has to be done every time that operator is used. But overhead can be driven down using the standard Python cache decorator on a function call that is passed the operator (either name or ascii sequence) and the current $CharacterEncoding value. (Note that in the actual call, this value is known; so caching would have a value for each pair operator name and character encoding of values. But in reality, $CharacterEncoding wouldn't change much if at all.)

If a new builtin is used (because there is no direct equivalent in WMA), then there is a question of what conversion table to use.

Above, I used the character name. However our code currently records the ASCII operator equivalent. So such a function might want a conversion table of ASCII operator string to Standard Unicode, and maybe a different one if going to WMA Unicode. But I don't understand how the distinction between the two would be specified.

mmatera commented 2 years ago

@rocky, thanks for the reference. I have added #52 to present my thoughts about this.