utf8 characters in M2 and Emacs

mahrud commented 4 years ago

Is there any interest in using utf8 (or even unicode) characters more in Emacs? Recent versions of Emacs as well as most terminal emulators support unicode (not sure since which version). M2 itself also supports utf8 characters for variable names:

S = QQ[θ]; -- in Emacs use C-x 8 [Enter] 03B8 [Enter] to type θ
gens S
θ^2
S_0^2

Entering them is not super easy, but for instance S_0 is a an easy shortcut. I don't think allowing \theta as variable name would be possible.

As an example of an alternative way, some Sage functions have a latex_name option, which just tells Sage to print the LaTeX name of the variable or function when using latex(variable). Example:

sage: function('riemann', latex_name="\\mathcal{R}")
riemann
sage: latex(riemann(x))
\mathcal{R}\left(x\right)

We could have a similar option so that on viewers supporting utf8 we can print the utf8 character and on viewers supporting LaTeX, for instance in the documentation or the interactive shell, use the LaTeX code and let MathJax or KaTeX do the conversion.

This is somewhat related to #522

DanGrayson commented 4 years ago

The Agda input method, available as part of the agda package in homebrew, allows you to type \theta to get θ. If I use that with M2 I can easily use that in commands:


i1 : R = QQ[θ]

o1 = R

o1 : PolynomialRing

i2 : θ^6

       6
o2 = θ

o2 : R

i3 : θ + 1

o3 = θ + 1

o3 : R

The standard emacs function insert-char, available on the key sequence C-x 8 RET, allows you to type the unicode name of any unicode character to get it. These are the ones involving "theta" in the name:

GREEK CAPITAL LETTER THETA (Θ)
GREEK CAPITAL THETA SYMBOL (ϴ)
GREEK SMALL LETTER SCRIPT THETA (ϑ)
GREEK SMALL LETTER THETA (θ)
GREEK THETA SYMBOL (ϑ)
MATHEMATICAL BOLD CAPITAL THETA (𝚯)
MATHEMATICAL BOLD CAPITAL THETA SYMBOL (𝚹)
MATHEMATICAL BOLD ITALIC CAPITAL THETA (𝜣)
MATHEMATICAL BOLD ITALIC CAPITAL THETA SYMBOL (𝜭)
MATHEMATICAL BOLD ITALIC SMALL THETA (𝜽)
MATHEMATICAL BOLD ITALIC THETA SYMBOL (𝝑)
MATHEMATICAL BOLD SMALL THETA (𝛉)
MATHEMATICAL BOLD THETA SYMBOL (𝛝)
MATHEMATICAL ITALIC CAPITAL THETA (𝛩)
MATHEMATICAL ITALIC CAPITAL THETA SYMBOL (𝛳)
MATHEMATICAL ITALIC SMALL THETA (𝜃)
MATHEMATICAL ITALIC THETA SYMBOL (𝜗)
MATHEMATICAL SANS-SERIF BOLD CAPITAL THETA (𝝝)
MATHEMATICAL SANS-SERIF BOLD CAPITAL THETA SYMBOL (𝝧)
MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL THETA (𝞗)
MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL THETA SYMBOL (𝞡)
MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL THETA (𝞱)
MATHEMATICAL SANS-SERIF BOLD ITALIC THETA SYMBOL (𝟅)
MATHEMATICAL SANS-SERIF BOLD SMALL THETA (𝝷)
MATHEMATICAL SANS-SERIF BOLD THETA SYMBOL (𝞋)
MODIFIER LETTER SMALL THETA (ᶿ)GREEK CAPITAL LETTER THETA (Θ)
GREEK CAPITAL THETA SYMBOL (ϴ)
GREEK SMALL LETTER SCRIPT THETA (ϑ)
GREEK SMALL LETTER THETA (θ)
GREEK THETA SYMBOL (ϑ)
MATHEMATICAL BOLD CAPITAL THETA (𝚯)
MATHEMATICAL BOLD CAPITAL THETA SYMBOL (𝚹)
MATHEMATICAL BOLD ITALIC CAPITAL THETA (𝜣)
MATHEMATICAL BOLD ITALIC CAPITAL THETA SYMBOL (𝜭)
MATHEMATICAL BOLD ITALIC SMALL THETA (𝜽)
MATHEMATICAL BOLD ITALIC THETA SYMBOL (𝝑)
MATHEMATICAL BOLD SMALL THETA (𝛉)
MATHEMATICAL BOLD THETA SYMBOL (𝛝)
MATHEMATICAL ITALIC CAPITAL THETA (𝛩)
MATHEMATICAL ITALIC CAPITAL THETA SYMBOL (𝛳)
MATHEMATICAL ITALIC SMALL THETA (𝜃)
MATHEMATICAL ITALIC THETA SYMBOL (𝜗)
MATHEMATICAL SANS-SERIF BOLD CAPITAL THETA (𝝝)
MATHEMATICAL SANS-SERIF BOLD CAPITAL THETA SYMBOL (𝝧)
MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL THETA (𝞗)
MATHEMATICAL SANS-SERIF BOLD ITALIC CAPITAL THETA SYMBOL (𝞡)
MATHEMATICAL SANS-SERIF BOLD ITALIC SMALL THETA (𝞱)
MATHEMATICAL SANS-SERIF BOLD ITALIC THETA SYMBOL (𝟅)
MATHEMATICAL SANS-SERIF BOLD SMALL THETA (𝝷)
MATHEMATICAL SANS-SERIF BOLD THETA SYMBOL (𝞋)
MODIFIER LETTER SMALL THETA (ᶿ)

mahrud commented 4 years ago

I'll look into the Agda input method!

Could M2 convert θ to $\theta$ when outputting html?

DanGrayson commented 4 years ago

That could be made to happen, but why bother? Browsers can display unicode, and $\theta$ will be displayed as is.

pzinn commented 4 years ago

there are still issues with utf8 support I believe. this is apparent already in the improper formatting of theta^6 above. a related issue:

i1 : width "aξbc"

o1 = 5

i2 : length "aξb"

o2 = 4

DanGrayson commented 4 years ago

For proper formatting, we need to know the width of the unicode character when it gets displayed. It is not always 1. Here is an example where they are about 1.7 characters wide:

i24 : "你好你好你好你好你好"
       +++++++++++++++++

A proper solution would involve interrogating the display to discover the width, but that would take time.

A work-around for determining the number of unicode characters in a string is this:

i25 : # utf8 "你好你好你好你好你好"

o25 = 10

It's an O(N) algorithm, so it's not so bad, but who needs that number for anything?

DanGrayson commented 4 years ago

Here's the way it looks in emacs:

pzinn commented 4 years ago

The fact remains, at the moment the formatting is based on width (or length, I forget). it will go wrong:

i1 : R=QQ[a,ξ]

o1 = R

o1 : PolynomialRing

i2 :  I=ideal(a^2+ξ^2,a+31)

             2     2
o2 = ideal (a  + ξ , a + 31)

o2 : Ideal of R

i3 :  netList I_*

     +--------+
     | 2     2|
o3 = |a  + ξ |
     +--------+
     |a + 31  |
     +--------+

In a fixed-width font, shouldn't one be able to fix that satisfactorily?

DanGrayson commented 4 years ago

Is your proposal to guess that every unicode character has a width of 1 on the screen? If so, that might be a good stop-gap measure until we can determine the width of all the characters accurately on all output devices, for at least it would work for some of them.

One way to do that would be change the routine "netWidth" in d/actors5.d to return the maximum number of utf8 characters in the rows. A new routine for computing the number of utf8 characters in a string or could be modeled after the routine "utf8(y:Expr):Expr" in d/actors4.d, which converts a string (sequence of bytes) to a list of integers representing the unicode points in the string.

At top level, we would distinguish more between objects of class String, which would be regarded still as sequences of bytes, and objects of class Net, which are destined to be displayed on the screen, even though currently String is a type of Net in the hierarchy. Various spots in the top level formatting code that expect the width of a string to equal the width of the net that would result from it would have to be fixed to get formatting to work again.

But, what is the long-term solution? Here is an experiment that shows there may be none:

These are screen shots showing two states of the same emacs buffer before and after running M-x text-scale-adjust. The ratio between Chinese character widths and Roman character widths is not a constant independent of size. So asking the display for the ratio may be fruitless.

pzinn commented 4 years ago

the situation with asian characters does seem complicated. first, it's not clear to me monospace fonts exist, and second, there seems to exist a distinction between narrow and wide characters. however for characters such as greek characters (which occur more frequently in math), the situation is much simpler, they take exactly 1 space in monospace fonts.

DanGrayson commented 4 years ago

It's the same for Russian:

pzinn commented 4 years ago

on my to-do list.

DanGrayson commented 4 years ago

What are you intending to do? Shall we assign you to this issue or make a new one?

pzinn commented 4 years ago

You suggested to modify netWidth in d/actors5.d. I'd like to give it a try (unless someone else volunteers!). Yes, you can assign me to this issue.

DanGrayson commented 4 years ago

Okay. This is the sort of thing that requires testing, so I suggest making all the documentation for all the packages and running all the tests.

DanGrayson commented 4 years ago

And thanks!

mahrud commented 2 months ago

I just noticed this:

i1 : width "A\tB"

o1 = 3

i2 : << "A\tB";
A   B

i3 : width net "A\tB"

o3 = 9

pzinn commented 2 months ago

A related issue with \t is the fact that somewhere in the d code its width is hardcoded due to the following lines in stdiop.d:

     else if c == int('\t') then (
      o.column = ushort(((int(o.column)+8)/8)*8);
      )

I'd very much like to remove those lines since it messes up code positioning; emacs somehow magically fixes it on the fly, as explained to me by @d-torrance so removing these lines would require doing something at the level of emacs:

Ah, just read some source code and found a nice solution. There's a variable in Emacs exactly for this sort of thing! I think if we set compilation-error-screen-columns to nil in M2-comint-mode, it should work.

(again quoting @d-torrance)

Macaulay2 / M2

utf8 characters in M2 and Emacs #1069