XCCS mapping and extension

johnwcowan commented 3 years ago

There are two concurrent parts to this effort:

Updating our existing XCCS-to-Unicode mapping tables to cover all of XCCS 2.0 (1990), the last version published by Xerox. Unfortunately, nobody published any such tables, so it's going to be about fuzzy matching between the names of XCCS and Unicode characters.
Figuring out which characters from Unicode to add to the mappings so that Medley programs can handle them. There is a sketch here of what kinds of characters might be added and about how many.

ecraven commented 3 years ago

Doesn't the wikipedia page (https://en.wikipedia.org/wiki/Xerox_Character_Code_Standard#XCCS_2.0) have a mapping to unicode? Or is that not all of XCCS 2.0? Also: do we have a copy of the actual standard document?

johnwcowan commented 3 years ago

Not even close, and with many errors.

On Fri, Jul 16, 2021 at 4:18 AM ecraven @.***> wrote:

Doesn't the wikipedia page ( https://en.wikipedia.org/wiki/Xerox_Character_Code_Standard#XCCS_2.0) have a mapping to unicode? Or is that not all of XCCS 2.0?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/350#issuecomment-881266452, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPPBRJJ5L4SR6DYFPH7A3TX7TM3ANCNFSM5AOLBADA .

ecraven commented 3 years ago

Is a copy of the standard available anywhere? Or do we have to "reconstruct" it based on the existing xccs 2.0 fonts and medley source code?

johnwcowan commented 3 years ago

There are now links to the 1.0 and 2.0 standards in the gist.

On Fri, Jul 16, 2021 at 11:34 AM ecraven @.***> wrote:

Is a copy of the standard available anywhere? Or do we have to "reconstruct" it based on the existing xccs 2.0 fonts and medley source code?

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/350#issuecomment-881536534, or unsubscribe https://github.com/notifications/unsubscribe-auth/AANPPBT7VNV3CBA6YVK5VGDTYBGQDANCNFSM5AOLBADA .

ecraven commented 3 years ago

I've played around with a bit of OCR and some manual work, I could do this for all the pages up to 4-28, if you want. http://ix.io/3teu

ecraven commented 3 years ago

Also, there seem to be quite a few errors in the headings, for example: (page 4 19) 364₈ = 254₁₀ = FE₁₆ (364 octal is 244, not 254/FE) This occurs several times. Are the code charts in the Appendix the authoritative source for the Character Set numbers?

nbriggs commented 3 years ago

I would use as reference for character set numbers the "Version 2.0 Set Allocation" table, on page 185 (B-3), [https://github.com/Interlisp/history/blob/master/1990s/XCCS/DEF-1999-00397_Character_Code_Standard_XNSS-059003_Jun-1990.pdf#page=185] and then in general, I would trust the octal in headings over the decimal or hex, since we almost entirely used octal when speaking of character sets. You'll notice that the font file names use the octal character set number.

masinter commented 3 years ago

I think the Issue here is constructing an "authoritative source". When you have something stable, we can request IANA to add it to the IANA registry. Confusingly, IANA uses "character set" for what we are calling "format" and what HTTP (and MIME) calls "charset", while we are (for the most part) using "character set" for subsets of XCCS codes numbered by the high-order 8-bits of the XCCS code.

For future readers, please make sure there is some context or references when you use these terms, eg., by providing references. For completeness, "the gist" in this issue refers to John Cowan's XCCS 3.0 gist.

masinter commented 3 years ago

A link to the document you are citing page numbers for would be helpful. PDF works with "fragment identifiers" in the URL https://example.com/docname.pdf#page=185 to link to page 185 (as per RFC 8118)

johnwcowan commented 3 years ago

I'm wondering how control characters, especially newline, are represented in external XCCS files. The XCCS 2.0 book is silent on it, but I see two possibilities:

If you are not in charset 0 already, you have to switch to charset 0 (using FF 00) in order to include them.
You can use them regardless of which charset you are in. This is safe because bytes 00-1F and F7 are never assigned to characters and bytes 01-1F and 7F are never used as character set selectors.

Probably the best way to find out is for someone to look at the logic for XCCS-format textual I/O.

masinter commented 3 years ago

Without looking at the code -- the answer must be 1. That's because the internal format is to use "fat" (16-bit) characters for all strings and treat the run-coding as an optimization for mainly ascii text only externally.

nbriggs commented 3 years ago

I agree with Larry -- it's 1. The external representation is the "stringlet", which can be 1-byte or 2-bytes for each character code (and can switch arbitrarily), see section 5 String Encoding -- but the character code for ASCII NL is 10 and you have to process the stringlet to get the character code before you decide what it is.

nbriggs commented 3 years ago

As they point out, 377 377 000 shifts into 16-bit mode, where NL would be 000 012in a stream of other 2-byte codes.

johnwcowan commented 3 years ago

Because of the way the characters are laid out in XCCS, there can be no other random 10 bytes in the bytestream: 10 can appear only as 10 in a 1-byte stringlet or 00 10 in a 2-byte stringlet (both representing NL), because there is no charset 10, and in all charsets except 0, the character code 10 is unassigned. This was intended for compatibility with ISO 2022, a character encoding framework no longer in productive use.

This is different from UTF-16, where, for example, U+0110 would be encoded as 01 10 or 10 01 depending on the endianism. But in any case, I'll assume that control characters are only accessible from charset 0. This allows me to make use of 66 charsets that were architecturally reserved in XCCS 2.0, as well as being able to encode at least 67 new characters in every existing charset other than 0.

@nbriggs: Can you send me the mappings for charset 2 (Functions)? I assume that Meta is the same as charset 0: that is, since "A" is 0041, Meta-A is 0141. If this is not correct, let me know.

nbriggs commented 3 years ago

Do we have a copy of XCCS 3.0 (or whatever the Interpress fonts that refer to XC1-3-3-0 rather than XC1-2-2-0 mean) ?

johnwcowan commented 3 years ago

Those appear to be XCCS 1.3.3.0 and XCCS 1.2.2.0 respectively, both of which were superseded by XCCS 2.0. There never was an XCCS 3.0 from Xerox, so I'm using that number for what I've been working on.

johnwcowan commented 3 years ago

By the way, the gist is now greatly enlarged, and thanks to the architectural changes, we see that there is enough space to do all the most important CJK characters, all the emoji, many if not all scripts used for modern languages, and plenty else.

rmkaplan commented 3 years ago

I think that adding to the XCCS/Unicode map is generally a good thing to do, but it probably won't have much impact on Medley or existing Medley files, in our current state. In particular, I would be surprised if there are any existing files, source or Tedit, that make use of characters that aren't already in the mappings that we have.

The reason is that I don't think that people would have used characters that our display fonts don't have glyphs for, and our fonts are certainly not complete. Classic may be the most complete, Terminal (fixed pitch) gives you lots of black boxes.

So a first practical increment would be to be able use the Unicode mappings in such a way that we can make use of the larger variety of fonts that the OS comes equipped with, to rasterize into strike fonts and to extract hardcopy widths. Or to have subrs that would put the bits up on the screen on-the-fly without separately rasterizing.

And in the longer run, I think we should use these mappings to get rid of XCCS as the internal representation. For that we need the architecture of normalization that John has been thinking about.

johnwcowan commented 3 years ago

I think that adding to the XCCS/Unicode map is generally a good thing to do, but it probably won't have much impact on Medley or existing Medley files, in our current state.

Oh, absolutely: this isn't going to have a quick payoff. But it interests me, so it's what I'm going to work on at present.

So a first practical increment would be to be able use the Unicode mappings in such a way that we can make use of the larger variety of fonts that the OS comes equipped with, to rasterize into strike fonts and to extract hardcopy widths.

Yes. However, you probably don't want to rasterize OpenType fonts in Interlisp: they are complex and it's just duplicating the large effort that other people have put in. So we should add the capability to Maiko to take a line of characters (you need a line at minimum to get bidi right) and have it invoke FreeType + HarfBuzz (bidi) + Pango (i18n), returning the bits to Lisp.

And in the longer run, I think we should use these mappings to get rid of XCCS as the internal representation.

The difficulty there is that Unicode charcodes won't fit in a fixnum. Changing fixnums to 24/32 bits is, I assume, a much larger change, and just adopting Unicode Plane 0 only is, I think, a mistake, especially for Chinese and emoji. So by making this effort we can intelligently subset Unicode, maintain backward compatibility (no changes to the sysout required), and away we go.

For that we need the architecture of normalization that John has been thinking about.

Normalization will still be necessary.

rmkaplan commented 3 years ago

I think that bidi comes later on the incremental path, and presumably only applies reliably to Tedit-format files (I doubt that Tedit does bidi now, but maybe John did jump into that pool). We can put a bidi bit in strings and atom names, and print the characters from right to left, that’s easy enough. But what about printing a sequence of those things? Would that be sensible?

Similarly, 24 bit is (much?) later, on the road to being complete wrt Unicode. If we can do the right thing for 16 bit Unicode, including some way of dealing with modification characters and normalization within the logic of a programming environment as opposed to just a text editor or rendering engine, that would be a major milestone.

On Jul 19, 2021, at 7:54 AM, John Cowan @.***> wrote:

I think that adding to the XCCS/Unicode map is generally a good thing to do, but it probably won't have much impact on Medley or existing Medley files, in our current state.

Oh, absolutely: this isn't going to have a quick payoff. But it interests me, so it's what I'm going to work on at present.

So a first practical increment would be to be able use the Unicode mappings in such a way that we can make use of the larger variety of fonts that the OS comes equipped with, to rasterize into strike fonts and to extract hardcopy widths.

Yes. However, you probably don't want to rasterize OpenType fonts in Interlisp: they are complex and it's just duplicating the large effort that other people have put in. So we should add the capability to Maiko to take a line of characters (you need a line at minimum to get bidi right) and have it invoke FreeType + HarfBuzz (bidi) + Pango (i18n), returning the bits to Lisp.

And in the longer run, I think we should use these mappings to get rid of XCCS as the internal representation.

The difficulty there is that Unicode charcodes won't fit in a fixnum. Changing fixnums to 24/32 bits is, I assume, a much larger change, and just adopting Unicode Plane 0 only is, I think, a mistake, especially for Chinese and emoji. So by making this effort we can intelligently subset Unicode, maintain backward compatibility (no changes to the sysout required), and away we go.

For that we need the architecture of normalization that John has been thinking about.

Normalization will still be necessary.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/Interlisp/medley/issues/350#issuecomment-882614827, or unsubscribe https://github.com/notifications/unsubscribe-auth/AQSTUJIFYB56LPKIAPQZQXLTYQ4DDANCNFSM5AOLBADA.

rmkaplan commented 3 years ago

With respect to control characters, the XCCS standard does not allocate charsets in that range (except for 0) and no charset other than 0 has assignments to the control subset. So it would presumably be heuristically safe, in a runcoded Greek file, to simply put out a CR or LF (or CRLF sequence) byte without changing the character set.

But Medley certainly doesn't do that. It uses the XCCS \OUTCHAR and \INCCODE functionS to print and read characters to/from an XCCS-formatted file. The only funky transformation is that it converts the internal (CHARCODE EOL) to CR, LF, or CRLF sequences and then prints in its ordinary way (and reads such a sequence back to (CHARCODE EOL)). So if you are runcoded in Greek, it will put out 255 0 10 255 [greek]. If not runcoded, it will put out the 2 bytes 0 10.

This is unrelated to the fat/thin internal representation for atom names and strings.

johnwcowan commented 3 years ago

I think that bidi comes later on the incremental path, and presumably only applies reliably to Tedit-format files (I doubt that Tedit does bidi now, but maybe John did jump into that pool). We can put a bidi bit in strings and atom names, and print the characters from right to left, that’s easy enough. But what about printing a sequence of those things? Would that be sensible?

Bidi is a character property, not a string property, and it isn't a bit, it's a property with 14 possible values (plus 9 "bidi controls" that override the standard bidi algorithm when necessary. The good news is that bidi matters only while rendering; there is no reason why the Medley internals need to know anything about it.

If we can do the right thing for 16 bit Unicode, including some way of dealing with modification characters and normalization within the logic of a programming environment as opposed to just a text editor or rendering engine, that would be a major milestone.

I'm now proposing the following as an alternative to messing further with XCCS mapping and extension (it's also in the gist):

An alternative to XCCS 3.0 is to adopt UCS-2 (Plane 0 of Unicode) as the internal representation. Plane 0 is basically full, so we can't represent all of the UnihanCore2020 and emoji lists discussed above: it requires 1904 additional UnihanCore2020 and 1542 additional emoji characters. We also need 128+ Meta pseudo-characters and 112 function-key pseudo-characters. Fortunately, we can grab 3686 characters from the Private Zone (6400 characters) for this purpose and convert them to real Unicode when reading and writing files and when font rendering.

johnwcowan commented 2 years ago

See #906 for further development.

Interlisp / medley

XCCS mapping and extension #350