UTF-8 Support - Githubissues

tsjensen commented 10 years ago

Boxes currently works with ASCII characters only. Well, at least every character must consist of a single byte. Multi-byte character support becomes increasingly important, so boxes should be upgraded to support UTF-8. (submitted by David Karafiát on Sat, 28 Jun 2008, but since requested many times)

tsjensen commented 10 years ago

This would be the most pressing feature request, as more and more text files use multi-byte character sets. It affects the basic operation of the progam though, because boxes does not differentiate between characters and bytes.

lorenzogatti commented 9 years ago

Unicode support would allow use of box-drawing characters: an important feature. The configuration file can just evolve to UTF-8; for input and output, availability of iconv(1) or the like can be assumed (allowing boxes to consume and produce UTF-8 only).

hildred commented 9 years ago

I propose adding a config option charwidth with the following possible values:

ascii (default): 7bit characters in a byte high bit must be 0.
8bit: for fixed width national charsets like latin-1
16bit: for fixed width wide characters (a 32 bit variant might be useful somewhere, but I don't know where)
utf8: variable width character encoding. (utf16 may be useful on windows)

The advantages of this is that we can ignore the charset (let the user wory about this) as long as the width in bytes works. This also allows us to work in filters for different charsets by just changing the config file.

tsjensen commented 9 years ago

Interesting idea. Since UTF-8 will probably be the most popular charset in the near future, how do you propose the variable width character encoding be handled? It seems difficult to me ... how would boxes tell the number of characters without knowing the exact encoding?

hildred commented 9 years ago

The char width would be specific to each box type. under unix check the locale variables (LC_*) for the string utf8, if it is present your default is variable width otherwise your default is 8bit adding options to override on the command line allows the user to specify character width for the text being boxed. mismatches in character widths cause the box type not to be scanned. ascii boxes would of course work with both 8 bit and variable width encoding.

You could also assume that the box charset is a subset of the stream charset whenever the width is smaller (which always works with ascii, latin1, 16 bit unicode and 32 bit unicode and may or may not work with other char sets, but as i said that is the user's issue) utf8 and utf16 would both be assumed to be at least 32 bit wide for comparison.

I also just had the idea that we could change the config file check order to check for charset specific configuration files first when the LC_ variables are set. ie /etc/boxes/boxes-config.charset, /etc/boxes/boxes-config, ~/.boxes.charset, ~/.boxes.

tsjensen commented 9 years ago

Hm... not sure. The files as well as the input consist of bytes which must be interpreted according to a character encoding in order to be converted into characters. One character in UTF-8 may consist of 1 to 6 bytes, so there is really nothing much that we can assume in a variable width scenario. I think we'll have to work with the character encodings because of that, but it seems that there is quite a bit of support from the compiler suite already (see this stackoverflow post).

hildred commented 9 years ago

The only assumptions I was recommending in a variable width setting are that the charsets in different representations have the same code points and that you have one code point per character (not per byte), so that if you have a file from windows in a utf16 fed in on standard in and a box drawn using a umlaut from latin1 in your config file, we put out utf 16 by prepending 0 to the latin1 umlaut because latin 1 is assumed to be a proper subset of utf 16 even though utf 16 may have characters wider than 16 bits. If the user is using latin2 that is his lookout, he needs to use iconv either on the config file or on the stdin/stdout as needed.

The assumption that one code point per character is broken two ways (but dealing with it right is a huge ball of wax). firstly combining characters use two codepoints to draw one character, and secondly asian and middle eastern languages have characters that are very wide (not just bits the won't fit in a character box when using a fixed width font). I recommend ignoring the second issue until there is a consensus about double wide characters in fixed width fonts. and ignoring the first until we have a working prototype.

tsjensen commented 9 years ago

The fact that input encoding, output encoding, and config file encoding may differ is not a problem in my opinion, because we can configure them separately. For internal processing, a reasonable superset must be defined, for example UTF-16. We should leave the intricacies of converting sequences of bytes to code points / characters to a some specialized (standard) library, and not program this on our own. As you say, that would be a huge ball of wax.

tsjensen commented 8 years ago

Since this issue has become the most prominent enhancement request by far, which is critical to the continued usefulness of boxes in the future, I hope to be able to address it this year. The current idea for implementation is to use a library (possibly ICU) for handling the encodings. Other recommendations welcome, especially ones with a low footprint.

tsjensen commented 8 years ago

ICU turned out to be too complex to handle on MinGW, which is the main environment on Windows. In fact, in the end, it did not work. So I am now investigating libiconv/libunistring, which come bundled with MinGW and are also widely supported. libunistring does not support regular expressions, but it seems that feature could be added by PCRE. So when this issue is implemented, boxes will depend on some third-party libraries.

grepsuzette commented 8 years ago

I was just going to to propose utf-8 support, glad to see it's already scheduled. Iconv does a good job at guessing a file's encoding on Linux, I'd also say it's a better approach than relying on environment variables.

livibetter commented 7 years ago

Note: I haven't not read any boxes code/implementations nor the comments above thoroughly. And I don't have much knowledge about locales and encodings. In case I am wrong, and probably somewhat am, C support for Unicode and UTF-8 is a good resource.

@tsjensen I think wchar_t for internal processing could be a good choice, one wchar_t is one character, simple as that and it's a standard datatype, as long as it can get the correct conversion from input and to output.

Although you don't need to count characters (no bytes to encoding boggling), but there is another related issue on this subject, the character columns, not character width in bytes, but the visual width. For example, CJK characters has character column = 2. That means each character is as wide as two Latin letter that you see in terminal (or with fixed-width/programming-specific fonts).

Currently, I can see character column isn't considered from a quick test. That means, for boxing multi-line comments with /* */, the lines with character columns != 1 would be space-padded short, it would not be a perfect rectangle box.

However, the solution is simple, one wcswidth(3) call (to replace current strlen? as I said I haven't read boxes code) can get correct columns of wchar_t string. (Although tab character (and other non-printable) could be an issue for that function.)

Frankly, I think trying to tackle the locale/encoding is as if trying to defuse a land mine while standing on it. I would suggest that boxes do not care for specific, user should run boxes under a locale with a Unicode encoding like UTF-8 and feed the input file with the same encoding. With that, mbstowcs(3) should have no trouble to convert at all. But perhaps a lot of people still has to boxes text/comment in different encodings.

tsjensen commented 7 years ago

Contrary to my earlier belief, I did not find the time to deal with this in the past year. So - anyone willing to tackle this? Help is much appreciated! This is certainly the most pressing feature request, and also a lot of work, because it will make boxes depend on external libraries and the whole code base must be refactored (although the code base is not huge). However, in my humble opinion, it is absolutely doable, nothing like "trying to defuse a land mine while standing on it". If you are seriously considering working on this, we should probably have a little conversation first. Not all requirements are immediately obvious from the code, such as compatibility with 20+ platforms etc. You may want to read at least my comments in this thread for starters. :smile: Btw, wchar_t does not look so good, but libunistring offers good alternatives (manual).

livibetter commented 7 years ago

@tsjensen I might be slightly exaggerating on that "landmine," alright, overly. (Admittedly, after replied, I did think about coming back to edit it to tone it down a bit or like 10 boxes)

But considering boxes has to support 20+ platforms, even wchar_t does not look so good in those references, but the need of handling encoding inside boxes surely complicates things and the coding, even with a cross-platform library.

I don't have Windows or any other than Linux, but as far as I've read, wchar_t is complier-specific, so I am not really sure if it does have problems in terms of width, since boxes seems to use GCC through MinGW, but there definitely something more I don't know about, perhaps those string functions are from target system's API, where the problems are at.

Nonetheless, at this moment, I barely have any knowledge about how boxes work, but even if I did, I won't be able to test other than on Linux. I am afraid I am not much of help even if I want to.

But I think whoever is going to tackle this, it's better to write a small piece of code as proof of concept, to make sure using libunistring or whatever alternative option does work on all 20+ platforms, before getting on modifying boxes code.