RasppleII / a2cloud

Connect your Apple // to the world via Linux
Other
5 stars 2 forks source link

Explicitly set LANG=C for serial login; better handling for CP437 terminals #5

Open knghtbrd opened 8 years ago

knghtbrd commented 8 years ago

From @IvanExpert on October 25, 2015 22:24

UTF-16 is 2-4 byte (not relevant, but just saying) UTF-8 is one byte 0-127, ASCII compatible; 2-6 bytes for everything else this screws up Apple II term programs for non-ASCII chars (e.g. hyphen, smart quote)

ISO-8859-* is one byte 0-255, with 128-255 variying by "part" 1-16 ISO-8859-1 is "Latin-1", revision is ISO-8859-15, others are langauge-specific Apple II text comm programs are going to display 0-127 anyway, since Apple II 128-255 are redundant or MouseText "ANSI" in a comm program means pseudo VT-100, and may also mean the "DOS CodePage 437" (IBM PC character set), as is the case with Spectrum ANSI emulation So it doesn't matter which ISO-8859 part, since the comm programs aren't going to use any of them. The main thing is that it's one byte per character, unlike UTF-8 TERM=vt100 on Pi makes Linux programs mostly display B&W, and makes ctrl-chars display on Spectrum ANSI TERM=pcansi on Pi makes Linux programs do color for Spectrum ANSI (TERM=ansi just breaks everything) LANG=en_US (as opposed to en_US.UTF-8) gets you ISO-8859-1, which is better for Spectrum ANSI, but the en_US ISO-8859-1 locale has to be available (from raspi-config) See A2CLOUD setup for how to generate locales from Linux prompt ProTERM VT-100 just repeats 128-255; ANSI BBS uses ASCII and mousetext to approximate DOS Code Page 437 Spectrum VT-100 is sort of arbitrary in 128-255 TERM=VT100 doesn't work with "ANSI" emulation because it outputs ctrl-O around text styling which is a displayed character in CP437

single-byte: ASCII is single byte 0-127 (0-31 are "C0" control codes, plus 127 is DEL) ISO-8859-* (1-16) is ASCII for 0-127, 128-159 are "C1" control codes, 160-255 are regional characters ISO-8859-1 is standard "Latin-1", ISO-8859-15 is updated for Euro and other chars

Microsoft has its own "codepage" numbers for character sets. Codepage 437 (aka "ANSI BBS") is the DOS character set: ASCII from 32-126, plus printable chars at 1-31 and 127-255; (all chars are also represented in UTF-8) "Linedraw" font for Windows provides characters 128+ for codepage 437: ftp://ftp.microsoft.com/Softlib/MSLFILES/GC0651.EXE (use 64.4.17.176 if doesn't resolve) Also "Terminal" font in XP provides most of it; Courier New is a Unicode font with most of the same characters Windows-1252 (codepage 1252) is ISO-8859-1 with additional chars from 128-159 instead of C1, including all chars in ISO-8859-15 Mac has "macintosh" or "MacRoman" encoding which is ASCII for 0-127 and its own characters for 128-255

UTF-8 characters 0-127 is same as ASCII UTF-8 characters 128+ are between two and four bytes and can represent everything (I guess) UTF-16 characters are between two and four bytes, and are endian-sensitive UTF-32 characters are always four bytes, and are endian-sensitive

Copied from original issue: RasppleII/a2server#39

knghtbrd commented 8 years ago

The real solution for terminals would be to define the appropriate terminal definitions for ProTERM, Spectrum, etc. Character sets and locales are more of an issue since these tend to be offered as iso8859-* or utf-8 or sometimes multibyte character sets that don't relate to the Apple // at all. We should be able to get cp437 working for Spectrum. It's possible that we could also get MouseText working for limited boxdraw support in things like dialog.

Not sure how to tag this one. It's a bug certainly, but a bug in what exactly, aside from A2CLOUD in general? I'll move this there, but the fix is going to be complicated.

IvanExpert commented 8 years ago

This isn't a bug. It behaves correctly already. The above info is just for reference.

The most important detail is that ISO-8859-* is an available locale for TERM to use, which it isn't out of the box on Raspbian -- only UTF-8 is. That makes everything look good on an Apple II, whose comm programs don't know about Unicode, and therefore can't handle multi-byte characters.

The packaging of Raspple II ensures ISO-8859-1 is made available, and that locale is specified for the serial console during A2CLOUD setup (I just forget how offhand, but I know that's what I made it do).

If you set ProTERM, Z-Link, or Spectrum to VT-100, it works great out of the box; if you set ProTERM or Spectrum for ANSI, and type "term color" at the prompt (which is an alias to TERM=pcansi), then cp437 is used instead of VT-100. There's questionable benefit to this in ProTERM but it's great for Spectrum's color display.

Of course, it would be theoretically possible to create a different emulation table for the proprietary "special" emulations offered by those programs, but a) why hardcode for one no-longer-maintained program, and b) in my observation, much of Linux is hardcoded for VT-100 and its derivatives, regardless of what TERM is set to.

See further conversation about this here, particularly my posts: https://groups.google.com/d/msg/comp.sys.apple2/WZ3p8IcrPrw/LToNxoh88IgJ http://appleii.ivanx.com/prnumber6/a2cloud-log-in-from-your-apple-ii/

knghtbrd commented 8 years ago

I don't think iso8859-1 does resolve the problem for the Apple // though—not really. The Apple // character ROM is 7 bit, and iso8859-1 is very 8 bit. Characters such as é and ü and ç and æ simply don't exist on the Apple, but are part of iso8859-1.

It's actually kind of too bad that we cannot easily load a soft-font into the Apple // text mode. Much cool stuff could be done with language support on the Apple // if we could.

IvanExpert commented 8 years ago

No, having an ISO-8859-1 local doesn't fully resolve the problem of the Apple II not having the ISO-8859-1 character set. However, it solves a lot of problems by virtue of simply being any kind of 8-bit character set with standard ASCII in characters 0-127.

Otherwise, the Apple II comm programs attempt to render the default, multibyte UTF-8 character set by displaying every single byte of multibyte characters, which causes a host of formatting problems even in things as trivial as man pages, and certainly web pages in things like Lynx.

I don't know whether it's possible to soft-load fonts for Spectrum based on Spectrum's ANSI mode, which is graphical -- I'd think you could replace the CP437 characters in the upper half with ISO-8859-1 characters. You could ask Ewen.

What might be interesting, and only occurs to me now, is that if CP437 is an available Linux locale, maybe that could be used instead of ISO-8859-1 for the Apple II shell login, and then Spectum (and to a lesser extent, ProTERM) would be able to accurately represent what is intended, to the extent that CP437 can represent characters in other encodings.

On Nov 18, 2015, at 5:38 AM, Joseph Carter notifications@github.com wrote:

I don't think iso8859-1 does resolve the problem for the Apple // though—not really. The Apple // character ROM is 7 bit, and iso8859-1 is very 8 bit. Characters such as é and ü and ç and æ simply don't exist on the Apple, but are part of iso8859-1.

It's actually kind of too bad that we cannot easily load a soft-font into the Apple // text mode. Much cool stuff could be done with language support on the Apple // if we could.

— Reply to this email directly or view it on GitHub https://github.com/RasppleII/a2cloud/issues/5#issuecomment-157674924.

IvanExpert commented 8 years ago

So, this is interesting. Due to a bug I just found in a2cloud-setup, which fails to write "en_US" into /usr/local/etc/a2cloud-lang during setup, and instead writes nothing, serial login is behaving probably better than was intended, by supporting only the ASCII character set, with no high characters. This bug should be made the permanent behavior.

This is because /usr/local/etc/a2cloudrc is supposed to be setting LANG=en_US, which uses ISO-8859-1, but instead it's setting LANG=C, which uses ANSI_X3.4-1968, aka ASCII. That's is the fallback, but it's happening because of the bug.

So, having discovered the bug and realizing that all this time I had been looking at LANG=C on the Apple II, I looked at a French web site in Lynx via SSH on my Mac (LANG=en_US.UTF-8); all accented characters looked good. I then looked at it on my Apple II with LANG=C. Also looked good, with no accents, but otherwise readable. Then I set the Apple II to LANG=en_US. Looked bad, with some characters incorrectly displayed, and formatting problems. So, ISO-8859-1 should probably be avoided altogether.

I think the correct course of action is to always set LANG=C for serial login in a2cloudrc, and remove reference to a2cloud-lang; and if we do this, then there's no need to generate the en_US locale at all, either in a2cloud-setup or in the Raspple II packaging steps. (However, I'd still prefer to generate the en_US.UTF-8 locale in the packaging steps to replace the default en_GB.)

As a footnote, I took a look, and there's no Debian locale in /usr/share/i18n/SUPPORTED that uses the CP437 charset, even though that charset exists in /usr/share/i18n/charmaps. I might try, just for grins, to see if I can get a locale to use it, because it would be slick if Spectrum's graphical ANSI emulation could actually represent accented letters correctly.

IvanExpert commented 8 years ago

Update: while there is no locale that supports the IBM437 charset, it might be worthwhile to create one as an option for accented character support in Specrtrum's ANSI display; alternatively, users can use the included Links browser, which ignores the locale and provides its own character set menu, from which CP437 can be selected.

I took a little-used locale ('eo', which is Esperanto, and which is the only locale to use ISO-8859-3), and simply replaced took /usr/share/i18n/charmaps/IBM437.gz and renamed it to ISO-8859-3.gz. I then created the 'eo' locale using dpkg-reconfigure, and then set LANG=eo.ibm437 and TERM=pcansi in Spectrum's ANSI display. (To make things readable in color text, in Spectrum choose Settings -> ...More Display Options -> Support color -> Use high intensity.)

Apart from the surprise of Lynx's menus being in Esperanto, I was able to render accented characters correctly. A demo of the character set can be seen here in Lynx: http://www.kostis.net/charsets/cp437.htm http://symbolcodes.tlt.psu.edu/bylanguage/french.html

And you can use them in Links as well, if you need to select Setup -> Character Set -> CP437.

TL; dr: If we think generalized, system-wide accented character support is desirable on the Apple II in Spectrum's ANSI emulation, we could make a new locale that uses the CP437 character set; if we think it's mostly useful for browsing only, users can use Links without us doing anything as long as we document how. (Or if we actually wanted to support ISO-8859-*, we could presumably edit Spectrum's ANSI character set; I assume Ewen would be receptive.)

As an aside, I noticed that Spectrum gets ASCII 130 wrong; it's supposed to be a lowercase e with a "forward slash" accent (acute), but it has a "two dots" (diaeresis or umlaut) instead.

knghtbrd commented 8 years ago

We could perhaps add en_US.ascii and en_US.cp437 locales. I don't know exactly how to add them to the decongestant menu without building some custom locale packages, but it is easy enough to generate a new locale. One reason to use something other than C for LANG is so that accented letters appear in correct sorting order. That's not really a big deal for ASCII-only terminals like ProTERM, but it would matter for CP437.

Did you file an issue against Spectrum for its charset problem?

IvanExpert commented 8 years ago

I wrote an issue about Spectrum's character set issue here and notified Ewen, and also asked him if the character set is easily editable, for possible creation of an ISO-8859-1 character set that could be used with LANG=en_US.

This issue could probably be separated into two: one to make LANG=C the permanent default, and one to create the CP437 locale for those wanting to use Spectrum ANSI. (And possibly another for creating an ISO-8859-1 alternative character set for Spectrum ANSI.)

IvanExpert commented 8 years ago

Just did some homework on this pursuant to recent emails. Summary:

So, it might indeed be worthwhile to create a PSE termcap, based on VT-100 (because so much of Linux is hardcoded to VT-100 and derivatives), that maps box-drawing chars (Unicode? CP437?) and anything else suitable to MouseText.

For whatever reason, raspi-config in the OS X terminal displays box drawing characters even when LANG=C and TERM=vt100; however, using Spectrum's VT-100 emulation, it uses appropraite ASCII equivalents (when TERM=vt100), while its ANSI emulation (when TERM=pcansi) shows accurate box-drawing characters.

To use ProTERM within GSport, set it to use the IIgs Modem Port with a Null Modem (RTS/CTS) driver. In the window, then type ATZ followed by ATS0=1. Then in a telnet window, pipe whatever you want to send to ProTERM to nc port 6502, e.g.: echo -e "\x10\x40" | nc localhost 6502 will output a solid-apple if ProTERM Special is turned on.

To use Spectrum, you can do the same, but you can also use Telnet if A2SERVER 1.5.0+ is running somehwere.

knghtbrd commented 7 years ago

Still not sure what exactly to do with this one, so I'm marking it for requested help in case some old UNIX hand who's had more experience with these issues can offer some advice about the best way to do this stuff.

knghtbrd commented 6 years ago

Added the informational component to the wiki page.