asciinema / asciinema

Terminal session recorder 📹
https://asciinema.org
GNU General Public License v3.0
13.67k stars 902 forks source link

Accept ISO-8859-1 as a locale #589

Closed h3xx closed 8 months ago

h3xx commented 8 months ago

I was unable to get locale.nllanginfo to print 'US-ASCII', however, in my testing on Linux, most combinations of LC{CTYPE,*}=en_US returned 'ISO-8859-1'.

Note about the supposed Rust rewrite: Even though the rewrite in Rust has yet to manifest its source code publicly, much less officially release, in keeping with spirit of the GPL, I still wish to share my changes back upstream.

ku1ik commented 8 months ago

Is ISO-8859-1 a subset of UTF-8?

h3xx commented 8 months ago

Is ISO-8859-1 a subset of UTF-8?

Short answer: No.

Longer answer: ISO-8859-1 aka latin-1 was the default terminal encoding for a long time. It is a superset of US-ASCII. Both latin-1 and US-ASCII are 8-bit encodings.

There do exist characters in latin-1 that don't directly map to UTF-8 codepoints, however, there are libraries -- including Python3's string library -- that will translate between them.

utf8_string = latin1_string.decode('latin1').encode('utf8', 'ignore')

Programs and terminals that are in UTF-8 mode require handling ASCII control characters; "\e" (character 1b) IS part of ASCII but ISN'T part of UTF-8.

ku1ik commented 8 months ago

Thanks. Given it's not a subset of UTF-8 then it wouldn't be compatible with the whole asciinema stack, which by design is built with UTF-8/ASCII in mind.

The reason for using UTF-8 only is I want to keep asciinema player's ECMA/ANSI parser simple. By requiring UTF-8, which is a superset of ASCII, we can decode whole data stream into Unicode code points first and then run the parser over the code points (which include control chars). This works because control chars are encoded in UTF-8 with the same bytes (as they're in ASCII set).

By allowing other encodings we would need to make our terminal parser way more complex due to necessity for dealing with 2 levels of parsing at the same time, which is tricky (while solvable) due to various ambiguous cases.

It's exactly the same reason why mosh requires UTF-8 (see the "Why do you insist on UTF-8 everywhere?" answer in their FAQ).

Thanks anyway!

h3xx commented 8 months ago

I was unable to get locale.nl_langinfo(locale.CODESET) to give back the string 'US-ASCII' -- what locale settings would this give 'US-ASCII' as the result?

My tests are a bit confounding: One machine (Ubuntu 23.04):

$ LC_ALL=en_US.us-ascii python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ANSI_X3.4-1968
$ LC_ALL=en_US python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ANSI_X3.4-1968

Another machine (Slackware):

$ LC_ALL=en_US.us-ascii python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ANSI_X3.4-1968
$ LC_ALL=en_US python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ISO-8859-1

Another (CentOS 8):

$ LC_ALL=en_US.us-ascii python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ANSI_X3.4-1968
$ LC_ALL=en_US python3 <<< $'import locale\nprint(locale.nl_langinfo(locale.CODESET))'
ISO-8859-1