Add Support for UTF-8 Sources and I/O Stream

alan-if / alan

ALAN IF compilers and interpreters

https://alanif.se

Other

19 stars 3 forks source link

Add Support for UTF-8 Sources and I/O Stream #12

Closed tajmone closed 3 years ago

tajmone commented 5 years ago

First of all, I'd like to state that I think Alan should stick to use single-char encoded strings internally (ISO-8859-1 or Mac), for there are no valid reasons to make Alan Unicode aware — the Latin1 charset is more than sufficient to cover needs for most adventures in any western language, and the exceptions are not worth the change and the overhead that Unicode support would introduce in terms of memory and performance.

Having said that, I think that Alan should accept UTF-8 source code, and ARun should handle I/O text streams in UTF-8 by default. Here follows my rationale for this.

Basically, the Alan compiler should be able to read UTF-8 source files, and transcode them to ISO-8859-1 before parsing them. This shouldn't be a huge addition, for ALAN sources are still expected to contain only characters from the ISO range (although comments could contain Unicode, which is cool and wouldn't affect compilation). Basically, any character above 127 should be be decoded back to a single char, which is rather quick operation to do on any input stream — I've seen code examples for this in various languages, and they rarely exceeded 6 lines of code.

Also, ARun should be able to handle I/O streams in UTF-8 too, which would allow to it accept commands scripts in UTF-8, as well as generating transcripts in UTF-8 (which would simplify interactions with other tools).

As for Glk based terps, only the commands scripts and transcripts would have to deal with UTF-8, for the input player would be handled by the Glk interface there.

Whether UTF-8 streams should be the default or optional, ISO and Mac could still be supported (by a CLI option, or by default), but I'm sure that most users would choose UTF-8, especially when working on their favourite editor or IDE. (A similar faith can be seen in TADS3, one of the first IF tools to introduce an push UTF-8, which initially had UTF-8 as an option, but today is mostly used in UTF-8 only).

Editors Support

Most modern code editors assume UTF-8 as the default encoding, and many don't support well ISO-8859-1 — and those which do, they usually don't offer good protection against encoding break-up with paste operations, where pasting from the clipboard often leads to Unicode chars introduction in the document, automatically converting it to UTF-8.

When I was initially working on the Italian Library translation, I've faced frequent code corruptions during editing, until I created an Alan dedicated package for Sublime Text to enforce ISO-8859-1 (and had to open a feature request for it too, to prevent UTF-8 pasting from corrupting the source files). So I'm well aware how this can easily become a disruptive and frustrating issue.

Maybe those who work with English only don't notice this, but anyone writing in a language with accented letters and/or diacritics is soon going to bump into many issues.

LSP Support

The Language Server Protocol is proving a successful idea, and is being adopted everyday by more editors and languages as the protocol of choice for syntax highlighting, linting and even code refactoring.

The problem is that LSP requires all JSON-RPC messages to be sent in UTF-8, and LSP plug-ins working on other encodings are starting to report lot's of bugs and problems, especially in relation to text ranges and positions (which result in wrong coordinates).

See Microsoft/language-server-protocol#376 for a long discussion on this.

If ALAN could handle UTF-8 source files, it would open itself to a world of possibilities via LSP — writing a language server for ALAN wouldn't be all that difficult, its syntax is simple enough to allow writing an error tolerant parser to provide syntax highlighting. From there, we could have a binary Language Server for ALAN that could work with any editor and IDE supporting LSP, which would allow to focus all energies on a single package for all editors — and eventually even support the new LSP features for code refactoring.

LSP is a growing standard, with new feature being added in the course of time. The future is clearly heading in that direction, and even if LSP were to be replaced by another protocol in the future, most features will remain similar.

Tools Support

The above also applies to many tools, especially tools related to version control — Git doesn't offer any specific settings for ISO encodings, and most diffing tools also expect UTF-8 text streams today.

In some cases, pipelines could actually corrupt ISO sources.

Toolchains Support

Just take as an example Asciidoctor, which we are using for the Alan documentation, and all the problems we're facing due to ISO-8859-1 encoding in Alan sources and game transcripts.

EDIT: Now Asciidoctor supports use of include:: with ISO files! — Because Asciidoctor doesn't support use of include:: with ISO files (cf. asciidoctor/asciidoctor#3248), the documentation toolchains for the various libraries need to first convert all Alan sources, commands scripts and generated transcript to UTF-8, so that they might be usable in the documentation.

Similar issues are going to be everyday more common, especially with modern tools, for usage of any encodings beside UTF-8 is strongly discourages nowadays (not only in text files, but also for internal string representations):

http://utf8everywhere.org/

thoni56 commented 3 years ago

As noted in https://github.com/alan-if/alan-i18n/discussions/3#discussioncomment-632441, addition of a UTF-character based scanner/lexer would really be good.

However, I had a quick look at the code, and it is not as simple as changing code in the Alan compiler. The character sets used are actually generated by the scanner generator ("lexem parser generator") that is part of the legacy compiler-compiler toolset that is used to generate the scanner (and parser) for the Alan compiler. It supports not only the obsolete Mac OS Roman character sets but also ASCII, EBCDIC & IBM ...

Modifying this requires a deep, deep, deep dive into code as old as the one of Alan itself, using algorithms that have been lost, that noone has touched since... I've ended up here before, and that is probably a reason for it not progressing further. But I'll make a deeper dive, and document any findings here (or in the Alan design document) so that it's possible to retrieve that knowledge at a later point.

tajmone commented 3 years ago

I didn't think about this aspect, I just remembered that the ALAN sources have some custom functions for handling ISO and Mac conversions, somewhere internally. So, it seems that currently the easier workaround would be to use a tool like iconv to convert an UTF-8 source and feed it to the compiler or ARun in ISO-8859-1.

I would have thought that it was possible to intervene before the input stream, and simply convert it to ISO before feeding it to lexer, etc.: UTF-8 to ISO conversion should be a fairly simple operation in itself.

thoni56 commented 3 years ago

As a status update I can report that I now have branch with an (almost) working -charset utf-8 option.

What looked like a fairly straight forward conversion became quite hairy actually. The scanner reads the file in chunks so conversion state have to be retained between "chunks", and the conversion might also fail to convert everything since the chunk read might contain a part of, but not the complete, a multi-byte sequence. So a lot of C pointer and buffer manipulation and arithmetic. But the logic is there now. ~~Just one bug related to end of file handling that I need to fix and then clean up.~~ Now to ensure that interpreters convert back correctly (Darwin command line did not...)

But we do have a compatibility issue here. I'm foreseeing that we want -charset utf-8 as the default on platforms that use that (Linux and MacOS, primarily), as you indicated. But users that have already painstakingly gone through the steps to ensure that their sources are ISO8859-1 will suddenly have to use a switch.

But

We will introduce this for Beta8 and be very clear in the Release Notes
We should probably provide instructions on how to do the file conversions (iconv for *nixens, but how can we help Windows users?)
There's not a million Alan source files in non-ASCII out there

But it also prompts a question: how to handle Alan sources in git, like in the alanl18n project. "Git does not touch any characters other than line breaks" gives that checked out files come out with the encoding they are stored with. But given the historical adaption of UTF-8 we should probably convert those sources to UTF-8 after Beta8 is released.

tajmone commented 3 years ago

As a status update I can report that I now have branch with an (almost) working -charset utf-8 option.

That's great news!

What looked like a fairly straight forward conversion became quite hairy actually. The scanner reads the file in chunks so conversion state have to be retained between "chunks", and the conversion might also fail to convert everything since the chunk read might contain a part of, but not the complete, a multi-byte sequence.

Didn't think of that, but yes UTF-8 can be a pain when dealing with fix-sized buffers since you never know how many code points there are to a character until you actually scan it.

But we do have a compatibility issue here. I'm foreseeing that we want -charset utf-8 as the default on platforms that use that (Linux and MacOS, primarily), as you indicated. But users that have already painstakingly gone through the steps to ensure that their sources are ISO8859-1 will suddenly have to use a switch.

That's a hard decision indeed. On the one hand, we could argue that being ALAN still in Beta the breaking change is justified, and end users will have to convert their sources once.

Use UTF-8 BOM Instead of CLI Options

Another options could be to require UTF-8 files to contain a BOM, which would mean we don't even need a -charset utf-8 option: if the compiler sees a BOM then it will treat it as UTF-8, otherwise as ISO (or Mac/DOS if the users specified so), and the same with ARun regarding command scripts (and if the input command scripts is in UTF-8, then also the transcript should be in UTF-8). Usually UTF-8 files don't use a BOM, but the Unicode specification doesn't forbid to use one, and indeed many MS products do use a BOM for UTF-8 sources (for historical reasons).

We should probably provide instructions on how to do the file conversions (iconv for *nixens, but how can we help Windows users?)

Git for Windows already ships with iconv (usr/bin/iconv.exe in the install dir). Non Git users can download the Windows version from the Win32 GNU projec:

http://gnuwin32.sourceforge.net/packages/libiconv.htm

it's called Libiconv, but it also contains the iconv.exe (I've checked).

We could consider creating a small binary tool to convert ALAN sources from ISO/DOS/Mac to UTF-8, and include it in the ALAN SDK (could ignore DOS an Mac, really, who has every used them with ALAN 3?). I know it's reinventing the wheel, and that there's already iconv out there, but it's not a huge work either to build such a tool (probably less than a 100 lines of code).

There's not a million Alan source files in non-ASCII out there

True, but there are not a million ALAN 3 sources either (there definitely more ALAN 3 adventures than ALAN 3).

I think that the using the BOM would solve any backward compatibility problems, because both the compiler and ARun would work as before, unless they find a BOM in the sources, in which case they use the new system.

But it also prompts a question: how to handle Alan sources in git, like in the alanl18n project. "Git does not touch any characters other than line breaks" gives that checked out files come out with the encoding they are stored with. But given the historical adaption of UTF-8 we should probably convert those sources to UTF-8 after Beta8 is released.

I think we should adopt the new UTF-8 system in all repositories, it would make our life much easier in terms of interacting with the various tools.

The main problem I foresee is editor's support, i.e. each user has a favourite editor when it comes to working with ALAN sources, so switching to UTF-8 source might be hard if the rely on a third party ALAN editor syntax, which would require tweaking.

Probably the ALAN IDE would have to be tweaked in order to support UTF-8 files too, and I'll definitely need to update the Sublime ALAN syntax too (not sure if I can manage to keep both encodings though, because usually ST syntaxes only use one encoding for file extensions; I'll need to check if the editor API can be leveraged to auto-detect an UTF-8 BOM and switch encoding automatically).

thoni56 commented 3 years ago

Use UTF-8 BOM Instead of CLI Options

Another options could be to require UTF-8 files to contain a BOM, which would mean we don't even need a -charset utf-8 option: if the compiler sees a BOM then it will treat it as UTF-8, otherwise as ISO (or Mac/DOS if the users specified so), and the same with ARun regarding command scripts (and if the input command scripts is in UTF-8, then also the transcript should be in UTF-8). Usually UTF-8 files don't use a BOM, but the Unicode specification doesn't forbid to use one, and indeed many MS products do use a BOM for UTF-8 sources (for historical reasons).

My impression is that the recommendation for UTF-8 is to not use a BOM. But a couple of thoughts:

I'm reluctant to add even more stuff to the reading in the scanner ;-)
But, one option would be to look for the BOM, and if it looks like there is one (you can't be sure) strip it and use UTF-8 conversion, else go for paltform default or explicit option
I also would consider what editors do, e.g. if I create a file in Emacs on Linux it becomes UTF-8 without a BOM. What do other editors on UTF-8 platforms do if you do not specify encoding? UTF-8 without BOM would be my uneducated guess (except for Microsoft stuff...)

tajmone commented 3 years ago

My impression is that the recommendation for UTF-8 is to not use a BOM.

It's neither recommended nor condemned, it's perfectly OK to use one if the need be. Many applications that were originally designed for ASCII/ISO file did in fact adopt the UTF-8 BOM when switching to Unicode, to preserve backward compatibility.

I'm reluctant to add even more stuff to the reading in the scanner ;-)

Well, that should be fairly simple since the UTF-8 BOM is a fixed sequence occurring at the beginning of the file.

But, one option would be to look for the BOM, and if it looks like there is one (you can't be sure) strip it and use UTF-8 conversion, else go for paltform default or explicit option

I've used this conditional check in many apps, and I'd say its bulletproof (i.e. the BOM sequence is such an oddity that it's unlikely to naturally occur at the start of a document as something else than being a BOM).

I also would consider what editors do, e.g. if I create a file in Emacs on Linux it becomes UTF-8 without a BOM. What do other editors on UTF-8 platforms do if you do not specify encoding? UTF-8 without BOM would be my uneducated guess (except for Microsoft stuff...)

Surely, the BOM must be specified in most editors, unless you have an ALAN syntax with handles that. Usually it's as simple as choosing "convert to" (or "reopen as") and the "UTF-8 BOM" from the encodings list.

In the worst case scenario, it will be encoded as UTF-8 without BOM, which is going to be fine if the source contains only ASCII chars, but result corrupted if it contains chars above 126. But this is not different in any way from the current situation, i.e. users who might be writing ALAN code in UTF-8 instead of ISO, because of poor editor settings.

Probably for English adventures this has never been a problem, since they might not use any non-ASCII chars, and for languages with special chars the problem is that UTF-8 won't work with either the old ISO system nor the new one with UTF-8 BOM.

Unfortunately, the BOM seems the only guarantee to distinguish between ISO and UTF-8 files — as already mentioned, there is no bullet proof way to determine if a file is a true ISO-8859-1 file. Furthermore, without a BOM, what happens in a multi-source project, when the compiler is fed a first source with ASCII only chars, and then another one in UTF-8? I guess that the compiler will determine the encoding from the first file, and then assume that all other files are encoded the same. Or would it evaluate the encoding for each file? (I'd expect all imported source to be same encoded).

tajmone commented 3 years ago

The alternative would be to adopt UTF-8 only sources, and introduce a backward incompatible change — for which the simple solution is to pass old pre-Beta8 source through iconv, which is a fairly simple process, especially considering that there aren't probably more than a couple dozen ALAN 3 adventures in the wild.

Maybe this solution is cleaner, for it would allow to drop the whole ISO/DOS/Mac legacy encodings and their handling, and would allow end users to code ALAN in any modern text/code editor, out of the box (of course, they'd need to limit characters to the ISO-8859-1 charset).

We should probably ask on the newsgroup what other users think about the different solutions.

thoni56 commented 3 years ago

You are probably correct in that we need to handle a possible BOM in either case, as we can't know how the particular editor/platform do it, or indeed how the user decided to convert to UTF-8. So I'll add a BOM-recogniser/skipper to the list of things to do.

Furthermore, without a BOM, what happens in a multi-source project, when the compiler is fed a first source with ASCII only chars, and then another one in UTF-8? I guess that the compiler will determine the encoding from the first file, and then assume that all other files are encoded the same. Or would it evaluate the encoding for each file? (I'd expect all imported source to be same encoded).

In the implementation I have in mind mixed-mode will not be a technical problem as each file can be handled with different encodings. It is more a matter of determining what the possible charset options should mean then. Does a possible BOM override the option (or indeed the "native" encoding for that environment, probably). That would at least allow a "native" set of files with interspersed UTF-8 files.

The alternative would be to adopt UTF-8 only sources, and introduce a backward incompatible change — for which the simple solution is to pass old pre-Beta8 source through iconv, which is a fairly simple process, especially considering that there aren't probably more than a couple dozen ALAN 3 adventures in the wild.

Yes. I think this is what I had in mind when I wrote about the compatibility breach.

The process would go something like this

Introduce the UTF-8 encoding option, and BOM-detection, in Beta8, keeping whatever platform "native" encoding, and the encoding options (which now does not include "mac" ;-)
Go for UTF-8 as the default on all platforms that could possibly have it, but still allow charset option Beta9
Remove charset option (maybe?)

tajmone commented 3 years ago

Adding `Option` » `Encoding` Keyword?

Remove charset option (maybe?)

Maybe extending Option/Options to also include Encoding could be a solution, since it would allow in-doc specification of the required encoding, with possible values:

iso — ISO-8859-1
DOS (for backward compatibility?)
UTF-8 BOM (default?)
UTF-8 — to allow handling correctly UTF-8 sources when using a BOM is not possible (e.g. due to editor problems, etc.).

Whatever the default encoding becomes, the Encoding keyword would allow authors to specify the encoding of each source (in mixed sources projects), while the --encoding CLI option would empower end users to specify encodings (or even override the value of the Encoding keyword?).

This should solve most problems concerning ALAN source files; as for commands scripts (input) the situation should be simpler to handle via ARun CLI options, and transcripts (output) should usually follow the encoding of the command script (if any), and the default encoding (or the one specified via CLI options) when transcribing a human gameplay (non automated via commands scripts).

thoni56 commented 3 years ago

Options are not allowed per file, only at the start of the main file. So that is not an option (pun intended...).

I think we are over-thinking this. The scenario that we really want is that all authors use the natural encoding on their platform. They should be able to just start an editor and create Alan source files with localized text. That will in many cases today be UTF-8 encoded, with or without a BOM. This should go for input and log files too.

The environments that this will not be sufficient for is probably (you have better multi-language skills than I do) obscure and would fall under the "not supported" category. (As per the alan-i18n discussion.)

The migration steps would either be a one-time conversion of all the files (as you suggested), or, if your UTF-8 files uses a BOM, one file at a time (since the presence of a BOM would override whatever "global" encoding was used for that file).

So using a BOM should be allowed, and provides you with this little advantage that you can convert the source files as you happen to edit them.

(There is one possible issue that I have not explored yet, and that is how the compiler actually treats and stores the words in input, e.g. if a verb "word" is stored in native or internal encoding. If it's in "native" encoding (the encoding of the file, assumed or explicitly indicated) then having files with different encoding might not match the same words. The compiler sometimes uses those strings for presentation to the used, like in error messages, so it is not trivial. And a very long time since I touched that part. But one step at a time, I'm sure once we get the "normal" case working, I'll get to those details.)

tajmone commented 3 years ago

Options are not allowed per file, only at the start of the main file. So that is not an option (pun intended...).

I forgot about that. So it would have to be an independent keyword, placed at the beginning of each source module.

I think we are over-thinking this.

Indeed, but it's worth overdoing brainstorming, especially if we end up preserving support for legacy encodings, because there are some many pitfalls with single-char encodings that over-thinking is safer than just assuming (if developers from the pre-Unicode era had over-thought their standards we probably wouldn't be struggling so much today).

The scenario that we really want is that all authors use the natural encoding on their platform.

The problem here is that "natural encoding on a platform" is more a reminiscence of the DOS/Terminal era. Modern Desktop OSs are no longer built on such settings, although they might emulate them in the terminal. Modern OSs are all Unicode-ready, although they might be handling Unicode strings differently internally (e.g. the WinAPI tends to use UCS-2 or UTF-16).

My understanding is that the Shell in Linux distros is always using UTF-8 encoding (so does that Bash for Windows), whereas on Windows the CMD might use different encodings (CodePages, actually) depending on the system locale, but PowerShell might be in UTF-8 because it's a modern tool.

When I used to work on Windows 10 using Italian as the default language, I would have to explicitly set the CodePage to ISO-8859-1 (or UTF-8, depending on the toolchain) to ensure correct ALAN compilations and automated transcripts. Now that I recently switched to the US EN locale, I think the default settings for CMD have also changed (but I'm not sure, and in any case it was not a fresh install, but a local switch).

They should be able to just start an editor and create Alan source files with localized text. That will in many cases today be UTF-8 encoded, with or without a BOM. This should go for input and log files too.

Yes, any modern editor defaults to UTF-8 (without BOM) as the default fallback encoding for unknown extensions (and also for most known ones too).

The environments that this will not be sufficient for is probably (you have better multi-language skills than I do) obscure and would fall under the "not supported" category. (As per the alan-i18n discussion.)

In the past, Windows XP Arabic edition used to be a different product altogether, but this is no longer the case since Win 10 natively supports Arabic, Hebrew and other LTR and non-Latin languages, usually requiring only to install the feature (that ships with the basic Win10 installer, but disabled). I'm not sure what the default CMD CodePage settings would be when the default system locale is Arabic, but I expect that for GUI applications everything is pretty much the same as any other locale, since Unicode handles LTR via control character, which allow pasting around any language snippet as long the editor is capable of handling LTR locales (Arabic and Hebrew fonts are preinstalled in any Win10 edition now).

The migration steps would either be a one-time conversion of all the files (as you suggested), or, if your UTF-8 files uses a BOM, one file at a time (since the presence of a BOM would override whatever "global" encoding was used for that file).

If ALAN is going to simply switch to UTF-8 as the default encoding, I would avoid the BOM altogether, since it's not common practice (although perfectly "legal" to use). If legacy encodings will be still supported via the --enconding compiler/interpreter switch, and both the compiler and interpreter are able to convert on the fly legacy encodings to UTF-8, then everything should be fine even without BOM.

I mean, how many ALAN authors will be affected by this breaking change? Probably not more than a dozen, if not less. Players won't be affected in any way by this.

So using a BOM should be allowed, and provides you with this little advantage that you can convert the source files as you happen to edit them.

Indeed, it would be good to do so, after all an UTF-8 file with a BOM is a perfectly valid UTF-8 file; it's just that the BOM is not usually added because today UTF-8 is the de facto standard; but as mentioned, various MS tools require a BOM for legacy support and to prevent backward incompatibility (so, we're not alone in this predicament and the possible solutions).

(There is one possible issue that I have not explored yet, and that is how the compiler actually treats and stores the words in input, e.g. if a verb "word" is stored in native or internal encoding. If it's in "native" encoding (the encoding of the file, assumed or explicitly indicated) then having files with different encoding might not match the same words. The compiler sometimes uses those strings for presentation to the used, like in error messages, so it is not trivial. And a very long time since I touched that part. But one step at a time, I'm sure once we get the "normal" case working, I'll get to those details.)

I thought that ALAN only stored strings in ISO-8859-1, but from what you're writing it seems that it's actually doing it "blindly", i.e. leaving them as they are in the input?

Ideally, internally all text should be strictly in ISO-8859-1, since Latin-1 covers all the needs of ALAN. So in case of UTF-8 and DOS/Mac sources, these should be internally converted to ISO-8859-1, right?

Another potential issue that came to my mind is error reporting. If the source file is in UTF-8, the compiler (and interpreter running in DEBUG mode too) would have to re-calculate the column number depending on whether the source file was in single-char encoding or UTF-8 (and the same problem of internal text vs source file conversions applies here).

thoni56 commented 3 years ago

Options are not allowed per file, only at the start of the main file. So that is not an option (pun intended...).

I forgot about that. So it would have to be an independent keyword, placed at the beginning of each source module.

I think we are over-thinking this.

Indeed, but it's worth overdoing brainstorming, especially if we end up preserving support for legacy encodings, because there are some many pitfalls with single-char encodings that over-thinking is safer than just assuming (if developers from the pre-Unicode era had over-thought their standards we probably wouldn't be struggling so much today).

Well, I partly agree. I'm not fond of it because it tend to prevent us from going forward. And that is important to get new input, knowledge and feedback that we can't think of ourselves. But having said that, Göran and me had a saying "Let's run through the forest in any direction and see if we find anything interesting and then take us back to the path, to start taking the small steps". The problem here are two things, it's hard to interpret how "serious" a suggestion is when you only have one of your senses in the conversation (we only see the text, not the intonation or facial expressions etc.). The other part is what are the parts that we should put more thought into, and which should we just try something, because it is easy to change later.

The scenario that we really want is that all authors use the natural encoding on their platform.

The problem here is that "natural encoding on a platform" is more a reminiscence of the DOS/Terminal era. Modern Desktop OSs are no longer built on such settings, although they might emulate them in the terminal. Modern OSs are all Unicode-ready, although they might be handling Unicode strings differently internally (e.g. the WinAPI tends to use UCS-2 or UTF-16).

What I meant was just that. The "natural encoding" is nowadays almost always UTF-8. But there might be some odd environments where this is not true. But it is important from an implementation point to define the terms "native" and "internal" encodings.

... I think the default settings for CMD have also changed (but I'm not sure, and in any case it was not a fresh install, but a local switch).

The virgin CMD in my fresh install of Windows (in Swedish) says "850 OEM Multi-language Latin 1". I almost never use it so I've no idea how to change it.

They should be able to just start an editor and create Alan source files with localized text. That will in many cases today be UTF-8 encoded, with or without a BOM. This should go for input and log files too.

Yes, any modern editor defaults to UTF-8 (without BOM) as the default fallback encoding for unknown extensions (and also for most known ones too).

Again, agreed.

...

If ALAN is going to simply switch to UTF-8 as the default encoding, I would avoid the BOM altogether, since it's not common practice (although perfectly "legal" to use).

I'm not sure I understand what you mean by "avoid the BOM altogether". What I had in mind, after your suggestions, was that the compiler would peek the first three bytes and then decide the encoding for that file. If it's a BOM, select UTF-8, if not push those back on input, and fall-back to "native" encoding, which might be UTF-8 anyway.

Or are you saying that that would be a bad idea? Why?

If legacy encodings will be still supported via the --enconding compiler/interpreter switch, and both the compiler and interpreter are able to convert on the fly legacy encodings to UTF-8, then everything should be fine even without BOM.

I mean, how many ALAN authors will be affected by this breaking change? Probably not more than a dozen, if not less. Players won't be affected in any way by this.

Exactly. I think we know them ;-)

So using a BOM should be allowed, and provides you with this little advantage that you can convert the source files as you happen to edit them.

Indeed, it would be good to do so, after all an UTF-8 file with a BOM is a perfectly valid UTF-8 file; it's just that the BOM is not usually added because today UTF-8 is the de facto standard; but as mentioned, various MS tools require a BOM for legacy support and to prevent backward incompatibility (so, we're not alone in this predicament and the possible solutions).

And now you are saying that looking for the BOM is a good idea.

(There is one possible issue that I have not explored yet, and that is how the compiler actually treats and stores the words in input, e.g. if a verb "word" is stored in native or internal encoding. If it's in "native" encoding (the encoding of the file, assumed or explicitly indicated) then having files with different encoding might not match the same words. The compiler sometimes uses those strings for presentation to the used, like in error messages, so it is not trivial. And a very long time since I touched that part. But one step at a time, I'm sure once we get the "normal" case working, I'll get to those details.)

I thought that ALAN only stored strings in ISO-8859-1, but from what you're writing it seems that it's actually doing it "blindly", i.e. leaving them as they are in the input?

Ideally, internally all text should be strictly in ISO-8859-1, since Latin-1 covers all the needs of ALAN. So in case of UTF-8 and DOS/Mac sources, these should be internally converted to ISO-8859-1, right?

There is a difference between "internal" and "internal". "Internal" as in all strings and text that should be propagated to the .a3c, yes. But strings that need to be communicated back to the user must/should (also) be available in "native" otherwise the error messages containing them will be garbled, as I mentioned above. (But again, one step at a time.)

Another potential issue that came to my mind is error reporting. If the source file is in UTF-8, the compiler (and interpreter running in DEBUG mode too) would have to re-calculate the column number depending on whether the source file was in single-char encoding or UTF-8 (and the same problem of internal text vs source file conversions applies here).

Yes, error reporting is a middle ground, as I said, but this is an extra point that I did not consider. Hmm, don't exactly know what to do about this. I think leaving it as it is, and learn by feedback if this is a huge problem, is probably good enough for now, although not exactly pretty. (Considering effort versus value here.)

tajmone commented 3 years ago

What I had in mind, after your suggestions, was that the compiler would peek the first three bytes and then decide the encoding for that file. If it's a BOM, select UTF-8, if not push those back on input, and fall-back to "native" encoding, which might be UTF-8 anyway.

Or are you saying that that would be a bad idea? Why?

And now you are saying that looking for the BOM is a good idea.

Sorry for the confusion here, it's because of the different context.

To clarify, my opinion on this is that:

If we keep support for legacy encodings, then:
- The UTF-8 BOM is the safest guarantee to distinguish an ISO source from an UTF-8 one — i.e. that's assuming that preserving legacy encodings is done for backward compatibility, which means that ISO would still be the default, unless a, UTF-8 BOM is found.
- If UTF-8 become the new default encoding (but legacy encodings are supported via CLI options) then the BOM might not be necessary, but probably it would prevent many problems with users who don't know about encodings (many don't).
If legacy encodings are dropped altogether (i.e. users will have to convert old sources via iconv in order to compile them), then:
- There's no need for an UTF-8 BOM, since UTF-8 would be the default and only encoding.
- Still, an UTF-8 file containing a BOM should be considered as valid UTF-8 (which it is), and the compiler should be able to consume the BOM to prevent feeding it to the parser (otherwise compilation would fail, and the user would not know why, since the BOM is hidden by the editor).
  
  Some old editors might enforce the BOM (I actually have a couple of such editors, which support either ASCII or Unicode, using UTF-8 BOM for the latter in order to distinguish them), so it's worth being prepared for a BOM, since end users might not even be aware of its presence (editors don't show the BOM in the source file).
  
  The compiler should probably print a notice about the encountered BOM (at least in verbose mode) — again, because its presence usually goes by unnoticed, so if an author works on a third party code he/she might simply be inheriting a BOM (which is unneeded in this case, so he should get rid of it unless forced to use it by the editor).

tajmone commented 3 years ago

As a practical example, command line syntax highlighters which default to UTF-8 usually check for an UTF-8 BOM, just to skip it if found, or to convert on the fly if it's the BOM of another encoding (e.g. UTF-16, etc.). So it's good practice to allow for an UTF-8 BOM, as opposed to failing to handle the file due to the BOM being treated as garbage.

thoni56 commented 3 years ago

Another step forward, the current sources on master allows native command line interpreters to communicate with user in UTF-8 using the -u option. There is also a -i, for ISO, which does nothing, swapping old -i (for ignore errors) moved to -e. Alpha documentation is updated.

There are a couple of snags, that I'll continue to explore

[ ] Command input editing is garbled in UTF-8 (have to keep track of characters vs. bytes)
[ ] Building on various platforms seems to vary in what they expect the command input is, some use UTF-8 (if terminal is set to that) but others, mainly cross-compilation to Windows, seem to use some other "random" encoding
[ ] The iconv-library is sometimes "builtin" (Linux), separate (Cygwin, Msys) or as a DLL (Mingw32, including Msys/Mingw32) so this is a packaging issue

But basically you can create source files in UTF-8 with or without BOM, compile them and run it in a UTF-8 terminal (if you use the right command line interpreter and manage to get the iconv library to load...).

WinArun and the Gargoyle slot in arun is not affected since they use Glk encoding. glkterm did not work in the single quick test I did on one of the many platforms (Linux, Cygwin, Darwin/MacOS, Msys, Mingw, ...).

tajmone commented 3 years ago

These are great progresses indeed.

Building on various platforms seems to vary in what they expect the command input is, some use UTF-8 (if terminal is set to that) but others, mainly cross-compilation to Windows, seem to use some other "random" encoding

If I understood correctly, this pertains to shell/CMD only — i.e. which encoding to expect when playing via the terminal. I guess there's no default for Windows CMD, as the default CodePage will vary depending on the default locale of the installation (something which a cross compiler can't guess).

I've found this discussion thread on the topic, on StackOverflow:

https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os

The situation under Windows is particularly complex because the OS supports dual contemporary active settings, one for legacy encodings and another for Unicode applications. The thread mentions various ways to query the system to obtain info about the default settings, ranging from CMD commands, to registry and WinAPI calls.

When using Batch scripts for ALAN, I always set the CodePage to either ISO-8859-1 or UTF-8 first, depending on the tasks ahead. The main problem is when text stream are piped through some CMD commands, since not all commands properly support UTF-8, so this has to be kept into account too.

I guess that ARun needs to determine the CodePage used by the CMD process that invoked it, since this might differ from the default System settings (e.g. because the user changed CodePage manually, or due to scripts), so querying the registry might not be a safe solution. Probably the bullet-proof way is to invoke the chcp command, which should return the correct CodePage for the current process. Surely, an API call would be better, but I have no clue which one that would be.

Also, there are no guarantees that ARun is being invoked from the CMD either; it could be from PowerShell, Bash for Windows or the recently added Windows Terminal — all of which differ in regard to default encoding settings, support for legacy encodings and Unicode, and other details affecting I/O text streams.

In any case, it looks like this would require coding a Windows specific solution.

WinArun and the Gargoyle slot in arun is not affected since they use Glk encoding.

not even for commands scripts and transcripts? which encoding they expect/adopt for such external scripts/logs?

glkterm did not work in the single quick test I did on one of the many platforms (Linux, Cygwin, Darwin/MacOS, Msys, Mingw, ...).

I've never used glkterm, although I'm aware of it. I wonder how many people do use it nowadays (especially under Win OS), but I guess that it's still worth keeping an eye on it for compatibility sake.

thoni56 commented 3 years ago

These are great progresses indeed.

Yes, they feel like a big break-through. Thanks for pushing for this!

Building on various platforms seems to vary in what they expect the command input is, some use UTF-8 (if terminal is set to that) but others, mainly cross-compilation to Windows, seem to use some other "random" encoding

If I understood correctly, this pertains to shell/CMD only — i.e. which encoding to expect when playing via the terminal.

Yes.

I guess there's no default for Windows CMD, as the default CodePage will vary depending on the default locale of the installation (something which a cross compiler can't guess).

Probably not.

I've found this discussion thread on the topic, on StackOverflow:

https://serverfault.com/questions/80635/how-can-i-manually-determine-the-codepage-and-locale-of-the-current-os

The situation under Windows is particularly complex ...

Yes ;-) But a good link.

I guess that ARun needs to determine the CodePage used by the CMD process that invoked it... ... In any case, it looks like this would require coding a Windows specific solution.

Right. And all this complexity boils down to, I think, we can't solve this with an effort that warrants the value. We should probably be content with having command line interpreters that work in the environments that we are using it, basically running automated testing.

I use this to run the 1500 regression tests for Alan itself and would do that on Cygwin/Msys, Linux and Darwin. I have no use for a general Windows command line interpreter that can automatically decide which encoding and/or code page to use.

It sounds like you don't either, as you are pretty adamant about exactly which environment you are running and I presume that it would be fairly straight forward to adopt that to whatever encoding is convenient for you. So then we just need to ensure there is command line interpreter that supports that environment.

The only outside users of CMD-use case is authors that actually run command line terps to do testing of their works. But we could just not support the CMD use case for multinational character sets, just document the problem away. Authors that need it should opt for some other environment. In any case it's no worse than previous versions...

And "real" users are probably not using a CMD terminal to run games anyway.

(Do you have an example of some CMD command tool that you use and dont' support full character sets? Do you need to use CMD?)

WinArun and the Gargoyle slot in arun is not affected since they use Glk encoding.

not even for commands scripts and transcripts? which encoding they expect/adopt for such external scripts/logs?

Good point. Did not think of that one.

So they also need to adopt the -u and -i options which should control which encoding command scripts and logs will use.

I was thinking that if you selected an encoding it would apply to all input and all output, and it should, but in this exception case the output to the "screen" is in Glk Latin-1, but output to logs will be whatever encoding selected (manual or automatic). The same will have to apply to command input, if from a file then recode if UTF-8 is selected, if from "terminal" no recoding.

glkterm did not work in the single quick test I did on one of the many platforms (Linux, Cygwin, Darwin/MacOS, Msys, Mingw, ...).

I've never used glkterm, although I'm aware of it. I wonder how many people do use it nowadays (especially under Win OS), but I guess that it's still worth keeping an eye on it for compatibility sake.

It's not a high-prio or "profile" interpreter, but I like to keep it around since it gives a way to compile to "standard" GLK libraries (other than WindowsGLK) and make sure that works. Except now, if it messes up with the encodings...

tajmone commented 3 years ago

I agree, end users playing games in a terminal (e.g. CMD) are responsible for setting the correct CodePage they're expecting ARun to work with.

What I never fully understood though, is how Windows handle piping text streams from/to different applications from within a binary app that invokes them programmatically. I've used this feature in various tools I've created, and using a language which allows to specify the encoding of outgoing and incoming strings, but I never understood if the OS is handling CodePage or encoding in the middle, e.g. when piping to some built-in commands, etc. (i.e. for those tools that don't have a dedicated encoding option/switch). Does the OS intervene at all? Does it rely on the default Encoding/CodePage settings? No clue.

In any case, when building command line tools I noticed that the safest way to go is output in UTF-8 and hope that end users will have set the correct CodePage — at least the tool will also work under Bash for Windows, which only supports UTF-8.

(Do you have an example of some CMD command tool that you use and dont' support full character sets? Do you need to use CMD?)

If my memory doesn't betray me, TYPE and MORE (definitely the latter) are problematic. E.g. passing UTF-8 text to MORE should result in spaces between chars, because they are interpreted as Latin-1. But this might have also changed in the meantime, since MS has revamped the CMD a couple of times in the past two years.

A good source to look up which native commands don't support Unicode, and the general topic of encodings and CodePage in CDM, is definitely SS64, which also covers undocumented material:

https://ss64.com/nt/

I also read there, under TYPE:

It is also possible to convert files from Unicode to ASCII with TYPE or MORE see the redirection syntax page for details.

And then, under redirection:

Unicode

The CMD Shell can redirect ASCII/ANSI (the default) or Unicode (UCS-2 le) but not UTF-8. This can be selected by launching CMD /A or CMD /U

With the default settings a UCS-2 file can be converted by redirecting it (note it's the redirection not the TYPE/MORE command that makes the encoding change) TYPE unicode.txt > asciifile.txt

European characters like ABCàéÿ will usually convert correctly, but others like £¥ƒ€ will become random extended ASCII characters: œ¾Ÿ?

So it seems the problem with these native command is that by Unicode they (MS) really mean UCS-2 (that horrible Unicode subset embraced by MS, still omnipresent in many MS tools and APIs).

Hence, although you can specify UTF-8 as the default encoding in CMD (via CHCP 65001) you still have to bear in mind that some native commands expect UCS-2 in their streams (which indeed is problematic).

I think that the problem is that MS at one point didn't expect Windows users to keep on using the CMD so much, or at least was hoping they'd embrace PowerShell, and it's only of lately that they started to polish the CMD (e.g. reintroduce ANSI colors and escapes, etc.). But there's only so much they can do in revamping the CMD and its native commands without breaking batch scripts compatibility by adding new options and features (and there are enough quirks between the various Win editions to juggle with already).

thoni56 commented 3 years ago

glkterm did not work in the single quick test I did on one of the many platforms (Linux, Cygwin, Darwin/MacOS, Msys, Mingw, ...).

I've never used glkterm, although I'm aware of it. I wonder how many people do use it nowadays (especially under Win OS), but I guess that it's still worth keeping an eye on it for compatibility sake.

It's not a high-prio or "profile" interpreter, but I like to keep it around since it gives a way to compile to "standard" GLK libraries (other than WindowsGLK) and make sure that works. Except now, if it messes up with the encodings...

Building glktermw against nursesw creates a libglktermw that supports UTF-8 that works perfectly when linked to produce glkarun.

(Although the w stands for "wide" that is not completely correct as it supports UTF-8 and not UTF-16 or UTF-32, AFAICT. It should probably be ncursesv for "variable encoding" or even ncursesu. But now I'm just nit-picking...)

I was wrong about that, glkterm actually has an impressive array of supported encodings, including "wide (whatever that is on this platform)" to quote the readme.

thoni56 commented 3 years ago

So. I think UTF-8 support is now nearly complete. I decided to rip out the old charset handling completely to be able to refactor the code better (no more 'dos' or 'mac'!). This means that you can control which encoding your source has using the -charset utf8 when invoking the compiler. It will then also print error messages and list files using that encoding.

For the interpreters -u will do the same, meaning that all file reading and writing will be using UTF-8 encoding. For non-GLK interpreters command input will also be in UTF-8 encoding. They also support command editing as before but now using mixed length characters (there might still be some glitches). That was probably code from 1985 that hasn't been touched since, so I took a very long time to get support for multi-byte character to work, and be somewhat readable code, but it was a fun exercise.

The default encoding for both compiler and interpreter is now still ISO-8859-1.

My plan is to release Beta 8 rather soon, with this in place. Beta 9 will then reverse the defaults so that UTF-8 becomes the default, but you will be able to force -charset iso-8859-1.

(I might add an alternative compiler option, -encoding, with the same values and effect before Beta 8.)

So now my focus will shift to the CI/build problems.

tajmone commented 3 years ago

So. I think UTF-8 support is now nearly complete.

That's wonderful news! Looking forward for the next Alpha SDK so I can switch the StdLib repo to UTF-8, which will provide a good test ground thanks to the library size and the big test suite.

I decided to rip out the old charset handling completely to be able to refactor the code better (no more 'dos' or 'mac'!).

Sounds reasonable. These encoding are only accessible through emulators nowadays, and I can't think of anyone who'd want to use that.

This means that you can control which encoding your source has using the -charset utf8 when invoking the compiler. It will then also print error messages and list files using that encoding.

:heart:

For the interpreters -u will do the same, meaning that all file reading and writing will be using UTF-8 encoding. For non-GLK interpreters command input will also be in UTF-8 encoding. They also support command editing as before but now using mixed length characters (there might still be some glitches).

What do you mean by "support command editing"? the feature of storing player input to an external .a3s files for replaying them via a solution file?

That was probably code from 1985 that hasn't been touched since, so I took a very long time to get support for multi-byte character to work, and be somewhat readable code, but it was a fun exercise.

I can imagine it was a huge leap in term of code, some of the third party modules and libraries might have been half a century old, or almost. Hopefully, these changes will also make the ALAN sources more maintainable in the future, with most of the legacy stuff having been revamped to modern usage.

The default encoding for both compiler and interpreter is now still ISO-8859-1.

My plan is to release Beta 8 rather soon, with this in place. Beta 9 will then reverse the defaults so that UTF-8 becomes the default, but you will be able to force -charset iso-8859-1.

Sounds reasonable, and will allow us to thoroughly test it via the StdLib repository, which should catch any glitches and edge cases before making UTF-8 the default encoding.

(I might add an alternative compiler option, -encoding, with the same values and effect before Beta 8.)

You mean just an alias?

I really appreciate your great efforts and support!

As you already know, I strongly believe that ALAN will be rediscovered again by IF authors at some point, and that all the hard work on these project will then come to fruition. Interest for IF fiction fluctuates in time, with sudden bursts of interest into the genre being fuelled by an interview, a documentary, a movie, or whatever else might make the topic viral.

With all the new projects reorganizing the documentation, libraries, etc., end users will be more attracted to ALAN, and realize that it's the most actively developed IF system of the present day — beside being beautiful and easy to use.

As developers, maintainers and contributes, our mission is to keep up the good work through thick and thin, regardless of how many people are using ALAN at any given time. Our goals are projected on the long term, for ALAN has already proven its worth and usability during the golden age of IF; now we only need to keep up with the challenges and changes of time and carry on the work. The rest will come by itself.

thoni56 commented 3 years ago

For the interpreters -u will do the same, meaning that all file reading and writing will be using UTF-8 encoding. For non-GLK interpreters command input will also be in UTF-8 encoding. They also support command editing as before but now using mixed length characters (there might still be some glitches).

What do you mean by "support command editing"? the feature of storing player input to an external .a3s files for replaying them via a solution file?

Replay of solution files is of course supported, using the same encoding as everything else, which is UTF-8 if -u is used.

No, what I mean is moving the cursor and deleting characters backwards, forwards, history. This is implemented using raw reads from the terminal if you are not using a GLK-based terp. And this has to be done in "external" encoding, so there's a lot of synchronization between the character buffer, which keeps the bytes, and the cursor position on screen which is of course displaying characters/glyphs.

For a user it is not even obvious that it had to change, but It was a lot more work than I anticipated so I guess that's why I mention it ;-) I won't in release notes, it would just confuse users, like it confused you ;-)

I can imagine it was a huge leap in term of code, some of the third party modules and libraries might have been half a century old, or almost. Hopefully, these changes will also make the ALAN sources more maintainable in the future, with most of the legacy stuff having been revamped to modern usage.

I would not say most but some things have improved...

The default encoding for both compiler and interpreter is now still ISO-8859-1. My plan is to release Beta 8 rather soon, with this in place. Beta 9 will then reverse the defaults so that UTF-8 becomes the default, but you will be able to force -charset iso-8859-1.

Sounds reasonable, and will allow us to thoroughly test it via the StdLib repository, which should catch any glitches and edge cases before making UTF-8 the default encoding.

That would be good testing.

(I might add an alternative compiler option, -encoding, with the same values and effect before Beta 8.)

You mean just an alias?

Yup. But it might be the future "real" option, since it really is "encoding" we are talking about here. "charset" is probably more appropriate to use when it comes to old style with code pages and what not.

I really appreciate your great efforts and support!

As you already know, I strongly believe that ALAN will be rediscovered again by IF authors at some point, and that all the hard work on these project will then come to fruition. Interest for IF fiction fluctuates in time, with sudden bursts of interest into the genre being fuelled by an interview, a documentary, a movie, or whatever else might make the topic viral.

With all the new projects reorganizing the documentation, libraries, etc., end users will be more attracted to ALAN, and realize that it's the most actively developed IF system of the present day — beside being beautiful and easy to use.

As developers, maintainers and contributes, our mission is to keep up the good work through thick and thin, regardless of how many people are using ALAN at any given time. Our goals are projected on the long term, for ALAN has already proven its worth and usability during the golden age of IF; now we only need to keep up with the challenges and changes of time and carry on the work. The rest will come by itself.

Well spoken! I'm not in this because of the huge user base ;-) But it is very satisfying to have at least one person that is more engaged and pushes/suggests things that need to be done. So, again, thanks Tristano!

I'm actaully going to close this issue now!

alan-if / alan

Add Support for UTF-8 Sources and I/O Stream #12

Editors Support

LSP Support

Tools Support

Toolchains Support

Use UTF-8 BOM Instead of CLI Options

Use UTF-8 BOM Instead of CLI Options

Adding Option » Encoding Keyword?

Unicode

Adding `Option` » `Encoding` Keyword?