texdraft commented 3 years ago

There are a couple open issues about UTF-8 and Unicode. I was going to write this as a comment on one of them, but I wanted to make a new issue to address Unicode support in general.

(I'm happy to begin working on Unicode implementation, as soon as the issues mentioned below are discussed.)

I have been contemplating what it would take to integrate Unicode into CWEB.

There are several things to consider. I am assuming that UTF-8 is the only input/output encoding that need be supported.

What should the internal representation of characters be?

Keeping them in UTF-8 form is attractive because the code can continue using char without fear; however, at some point a certain amount of decoding is required. The full extent depends on how much error checking we want to do and on the preferred action of CTANGLE. As an asthetic choice, eight_bits or a new, synonymous type octet could be substituted for char when the value is an octet of UTF-8 input.
UTF-16 is, I think you'll agree, a silly choice. To adopt it would have no benefits that I can see over UTF-32, other than that it takes less space.
Decoding the input fully into UTF-32 form, storing every character's full code point, is a viable strategy. One advantage is that encoding/decoding code can be separated from the parts of the programs that work with characters in memory. Unfortunately, all code would have to be modified to work with uint_fast32_t or whatever (probably hidden behind a code_point typedef) instead of char. The other major issue is that ASCII characters, which constitute the majority of typical C text, unconditionally occupy four times more storage than is necessary. But this isn't the greatest concern nowadays. It is convenient that every character takes up a single value.

The programs often advance to the next character in a string by incrementing a pointer by 1. If UTF-8 is chosen as the internal representation, then all such increments will have to be adjusted to compensate. Using UTF-32 would avoid this problem.

In summary, storing characters in UTF-32 form takes up more space, forces encoding/decoding, and requires altering most declarations related to characters; storing characters in UTF-8 saves space and allows declarations to remain unchanged, but most operations on characters would have to be changed.

Encoding or decoding could happen at the following points:

When storing names for sorting (see the heading “Collation” below).
When CTANGLE is reading @'…'. We probably want to extend the notation so that it “expands” into the ordinal value of any single character in the string, provided that that character corresponds to one code point. (Thus no notice is taken of combining characters.)
When CTANGLE is converting names for output, if it must transliterate (see the heading “Transliteration” below).

It might be easier to do encoding/decoding manually, not by trying to use any of C's “wide character” facilities. (Frankly, I find them obnoxious. Also, many uses of C input/output functions would have to be changed.)

One good thing about UTF-8 is that it is quite naturally expressed in octal, so CWEB's preference could be maintained through the transition.

Unicode character data

In any case, the hardest part about supporting Unicode beyond simple encoding and decoding is dealing with the Unicode character database. Unicode 13.0 assigns (gives meaning to) 143 859 out of 1 114 112 possible code points. Every character has many properties that describe it.

Unicode distributes a bunch of plain text files that contain the property data for all characters. Unfortunately, there is no file that consolidates all information into one place, except for the Unicode XML database.

I'm going to ignore the task of reading the data in for now. The more interesting problem is this: How do we store information about every character? A full implementation of Unicode would be forced to have a way to get the value of any property, but CWEB needs only a limited set.

Width.

CWEB's error reporting routine indicates the current position in the buffer by printing it out like this:

first part of line
                  second, unread part of line

The problem is that the code assumes that all characters occupy the same amount of horizontal space. In reality, some characters have no width, some are wider than one column, etc. The amount of effort it would take to get this correct probably far outweighs the utility of the feature. But it's certainly possible; GCC handles cursor position in Unicode input just fine.

Transliteration.

For CTANGLE, we must be able to associate some string of text with a character, defining its transliteration. All that's needed is a char *.

C99 and C++98 added a syntactic feature called a “universal character name”, which is basically a four- or eight-digit hexadecimal character code embedded in regular source text. For example, a\u200Bb gives you ab, where the two characters are separated by a zero-width space. According to Annex D of the C standard and lex.name.allowed in the C++ standard, this is a perfectly valid identifier. However, both languages prohibit many characters to appear as universal character names in identifiers. It is tempting to change CTANGLE's default transliteration to insert an equivalent universal character name, but the restrictions complicate matters.

Normalization.

Some strings of Unicode characters are effectively identical while not being exactly (i.e., numerically) equal. For example, a precomposed character like “ü” (U+00FC LATIN SMALL LETTER U WITH DIAERESIS) should usually be treated identically to its decomposed counterpart “ü” (U+0075 LATIN SMALL LETTER U and U+0308 COMBINING DIAERESIS).

Therefore Unicode defines (in UAX 15) a process of normalization, which converts strings to a canonical form. There are a few kinds of normalization, depending on whether you want to tend towards decomposing characters or towards composing characters and how you want to handle compatibility characters.

Several properties are associated with normalization, including Canonical_Combining_Class (a nonnegative integer below 256), Decomposition_Type (one of sixteen values), and Decomposition_Mapping (a string of at most eighteen code points).

It would probably be best for CWEB to normalize all strings before entering them into the character/byte memory.

Identifiers.

If we want “extended characters” to be allowed in identifiers, we need to know exactly which code points can begin an identifier and which code points can continue an identifier. Luckily there are properties just for this, thanks to UAX 31. Specifically, if a character has the property XID_Start, it can begin an identifier, and if a character has the property XID_Continue, it can be a part of an identifier.

(There are also ID_Start and ID_Continue. The X variants are for normalized text only.)

Collation.

Here's the big one. The entirety of CWEAVE's Phase III is devoted to sorting and outputting an index. Sorting the index involves putting names in order, according to a collating sequence; in the current version of CWEAVE, the collation is represented by the collate array. Unicode collation is much more complex, due to the expanded character set.

Full details of the Unicode collation algorithm can be found in UTS 10. It is based on four levels of comparison between strings. The specification requires that strings be normalized before comparison.

Collation needs a collation element table to work. The Default Unicode Collation Table (DUCET) can be found here; like the rest of the Unicode data, it is stored in a plain text file. In the DUCET, only three of the four levels of comparison are used, in order to allow implementations to extend the order for whatever internal reason. Other collation element tables exist for specific languages or conventions.

Storing the data.

In general, we want a way to map a twenty-one-bit number (probably held in a thirty-two-bit integer) to some data structure containing the character properties we are interested in. Storing all the needed information straightforwardly in a statically-allocated array would occupy about 45 megabytes on a sixty-four-bit system. I'm counting

The transliteration (char *)
Canonical combining class (uint8_t)
Decomposition type (short)
Decomposition mapping (char * or code_point * depending on the internal representation of characters)
XID start (bool)
XID continue (bool)
Collation element (struct { uint16_t a, b, c, d; })

We would have the transliteration string be NULL if no transliteration was given; then CTANGLE would compute it automatically.

I think that more attributes must be stored for normalization, so 45 megabytes is really a lower bound.

There are many ways of compressing this, of course. Full Unicode implementations typically use a kind of trie for looking up properties, because the entire set of properties for a single character takes up a lot of space. Compression is also possible because long runs of characters tend to share properties.

Since CWEAVE doesn't do transliteration, and since CTANGLE doesn't do collation, the two areas of storage could put into a union.

Actually getting the data.

I glossed over this earlier, but it's important. How can CWEB read the character information into memory? There is far too much to compile directly into the programs; should it be read at initialization? Ideally we could do what TeX does and save the program's state after initialization, but I'm not sure if there is a good, portable way.

The property information we want is found in the files UnicodeData.txt, DerivedCoreProperties.txt, allkeys.txt, and DerivedNormalizationProps.txt. Thus if CWEAVE or CTANGLE are starting up from scratch, they must read in four very large files.

Alternatively, we could write a program to extract only the relevant data from the relevant files and write it in an especially compact form to a new file, which would be read by CWEAVE and CTANGLE. I think that the most recent version of such a file should be distributed with CWEB, but I can certainly see arguments to the contrary.

[The program could be a more general utility (serving as another example of CWEB) that creates a compressed file containing a specified set of properties for each character. For instance, you might want to know only the names and aliases of characters; you can open the program, enter “name,alias”, and it would output a file accordingly.]

Or use a library.

I'm against this option. One of CWEB's appeals is that it is very easy to set up. It has no dependencies except on the C standard library; all you need is a C compiler to run CWEB. Existing Unicode implementations are bulky and annoying, and they wouldn't fit in with the rest of CWEB.

ascherer commented 3 years ago

I'll study your text ASAP and call back.

ascherer commented 3 years ago

@igor-liferenko has written UTF-8 code for CWEB. Details in TUGboat.

KishkinJ10 commented 3 years ago

can i contribute on this issue??

ascherer commented 3 years ago

Why not? It's open source.

At the moment, however, I have no time to dig into this issue. I'm still processing the contributions in cweb-4.3-dev and cweb-4.3 — CTWILL doesn't produce a nice and fluent rendering of itself just yet.

igor-liferenko commented 3 years ago

Hi

The UTF-8 implementation is in files comm-utf8.ch, cweav-utf8.ch, ctang-utf8.ch, ASCII.w and mapping.w

https://github.com/igor-liferenko/cweb

(the last change in ctang-utf8.ch assumes that your compiler supports utf8)

KishkinJ10 commented 3 years ago

i am a beginner can you guide me from where should i start?

ascherer commented 3 years ago

First run make; make cautiously; make fullmanual and read the four .dvi files.

Second read the changes for “TeX Live”, i.e., the *-changes.pdf files.

texdraft commented 3 years ago

@igor-liferenko Is collation (according to the Unicode collation algorithm) still an open question?

igor-liferenko commented 3 years ago

Simply store characters in xchr array in the necessary order.

igor-liferenko commented 3 years ago

It is in file mapping.w

ascherer commented 2 years ago

I haven't called back, sorry. Only recently I addressed a very small part of the whole UTF-8 conundrum in CWEB and closed issues #8 and #42.

ascherer commented 11 months ago

CWEB now runs with LuaTeX. Time permitting, I'll try to write a set of \font replacements in cwebmac.tex with Latin Modern.

ascherer / cwebbin

Unicode #48

Unicode character data

Width.

Transliteration.

Normalization.

Identifiers.

Collation.

Storing the data.

Actually getting the data.

Or use a library.