Closed tsjensen closed 3 years ago
This would be the most pressing feature request, as more and more text files use multi-byte character sets. It affects the basic operation of the progam though, because boxes does not differentiate between characters and bytes.
Unicode support would allow use of box-drawing characters: an important feature. The configuration file can just evolve to UTF-8; for input and output, availability of iconv(1) or the like can be assumed (allowing boxes to consume and produce UTF-8 only).
I propose adding a config option charwidth with the following possible values:
The advantages of this is that we can ignore the charset (let the user wory about this) as long as the width in bytes works. This also allows us to work in filters for different charsets by just changing the config file.
Interesting idea. Since UTF-8 will probably be the most popular charset in the near future, how do you propose the variable width character encoding be handled? It seems difficult to me ... how would boxes tell the number of characters without knowing the exact encoding?
The char width would be specific to each box type. under unix check the locale variables (LC_*) for the string utf8, if it is present your default is variable width otherwise your default is 8bit adding options to override on the command line allows the user to specify character width for the text being boxed. mismatches in character widths cause the box type not to be scanned. ascii boxes would of course work with both 8 bit and variable width encoding.
You could also assume that the box charset is a subset of the stream charset whenever the width is smaller (which always works with ascii, latin1, 16 bit unicode and 32 bit unicode and may or may not work with other char sets, but as i said that is the user's issue) utf8 and utf16 would both be assumed to be at least 32 bit wide for comparison.
I also just had the idea that we could change the config file check order to check for charset specific configuration files first when the LC_ variables are set. ie /etc/boxes/boxes-config.charset, /etc/boxes/boxes-config, ~/.boxes.charset, ~/.boxes.
Hm... not sure. The files as well as the input consist of bytes which must be interpreted according to a character encoding in order to be converted into characters. One character in UTF-8 may consist of 1 to 6 bytes, so there is really nothing much that we can assume in a variable width scenario. I think we'll have to work with the character encodings because of that, but it seems that there is quite a bit of support from the compiler suite already (see this stackoverflow post).
The only assumptions I was recommending in a variable width setting are that the charsets in different representations have the same code points and that you have one code point per character (not per byte), so that if you have a file from windows in a utf16 fed in on standard in and a box drawn using a umlaut from latin1 in your config file, we put out utf 16 by prepending 0 to the latin1 umlaut because latin 1 is assumed to be a proper subset of utf 16 even though utf 16 may have characters wider than 16 bits. If the user is using latin2 that is his lookout, he needs to use iconv either on the config file or on the stdin/stdout as needed.
The assumption that one code point per character is broken two ways (but dealing with it right is a huge ball of wax). firstly combining characters use two codepoints to draw one character, and secondly asian and middle eastern languages have characters that are very wide (not just bits the won't fit in a character box when using a fixed width font). I recommend ignoring the second issue until there is a consensus about double wide characters in fixed width fonts. and ignoring the first until we have a working prototype.
The fact that input encoding, output encoding, and config file encoding may differ is not a problem in my opinion, because we can configure them separately. For internal processing, a reasonable superset must be defined, for example UTF-16. We should leave the intricacies of converting sequences of bytes to code points / characters to a some specialized (standard) library, and not program this on our own. As you say, that would be a huge ball of wax.
Since this issue has become the most prominent enhancement request by far, which is critical to the continued usefulness of boxes in the future, I hope to be able to address it this year. The current idea for implementation is to use a library (possibly ICU) for handling the encodings. Other recommendations welcome, especially ones with a low footprint.
ICU turned out to be too complex to handle on MinGW, which is the main environment on Windows. In fact, in the end, it did not work. So I am now investigating libiconv/libunistring, which come bundled with MinGW and are also widely supported. libunistring does not support regular expressions, but it seems that feature could be added by PCRE. So when this issue is implemented, boxes will depend on some third-party libraries.
I was just going to to propose utf-8 support, glad to see it's already scheduled. Iconv does a good job at guessing a file's encoding on Linux, I'd also say it's a better approach than relying on environment variables.
Note: I haven't not read any boxes code/implementations nor the comments above thoroughly. And I don't have much knowledge about locales and encodings. In case I am wrong, and probably somewhat am, C support for Unicode and UTF-8 is a good resource.
@tsjensen I think wchar_t
for internal processing could be a good choice, one wchar_t
is one character, simple as that and it's a standard datatype, as long as it can get the correct conversion from input and to output.
Although you don't need to count characters (no bytes to encoding boggling), but there is another related issue on this subject, the character columns, not character width in bytes, but the visual width. For example, CJK characters has character column = 2. That means each character is as wide as two Latin letter that you see in terminal (or with fixed-width/programming-specific fonts).
Currently, I can see character column isn't considered from a quick test. That means, for boxing multi-line comments with /* */
, the lines with character columns != 1 would be space-padded short, it would not be a perfect rectangle box.
However, the solution is simple, one wcswidth(3)
call (to replace current strlen
? as I said I haven't read boxes code) can get correct columns of wchar_t
string. (Although tab character (and other non-printable) could be an issue for that function.)
Frankly, I think trying to tackle the locale/encoding is as if trying to defuse a land mine while standing on it. I would suggest that boxes do not care for specific, user should run boxes under a locale with a Unicode encoding like UTF-8 and feed the input file with the same encoding. With that, mbstowcs(3)
should have no trouble to convert at all. But perhaps a lot of people still has to boxes text/comment in different encodings.
Contrary to my earlier belief, I did not find the time to deal with this in the past year. So - anyone willing to tackle this? Help is much appreciated!
This is certainly the most pressing feature request, and also a lot of work, because it will make boxes depend on external libraries and the whole code base must be refactored (although the code base is not huge). However, in my humble opinion, it is absolutely doable, nothing like "trying to defuse a land mine while standing on it".
If you are seriously considering working on this, we should probably have a little conversation first. Not all requirements are immediately obvious from the code, such as compatibility with 20+ platforms etc. You may want to read at least my comments in this thread for starters. :smile:
Btw, wchar_t
does not look so good, but libunistring offers good alternatives (manual).
@tsjensen I might be slightly exaggerating on that "landmine," alright, overly. (Admittedly, after replied, I did think about coming back to edit it to tone it down a bit or like 10 boxes)
But considering boxes has to support 20+ platforms, even wchar_t
does not look so good in those references, but the need of handling encoding inside boxes surely complicates things and the coding, even with a cross-platform library.
I don't have Windows or any other than Linux, but as far as I've read, wchar_t
is complier-specific, so I am not really sure if it does have problems in terms of width, since boxes seems to use GCC through MinGW, but there definitely something more I don't know about, perhaps those string functions are from target system's API, where the problems are at.
Nonetheless, at this moment, I barely have any knowledge about how boxes work, but even if I did, I won't be able to test other than on Linux. I am afraid I am not much of help even if I want to.
But I think whoever is going to tackle this, it's better to write a small piece of code as proof of concept, to make sure using libunistring or whatever alternative option does work on all 20+ platforms, before getting on modifying boxes code.
Ok, for an issue with a help-wanted
tag, it's still too vague. So here's my understanding of the concrete steps:
-n iso-8859-15
).real_text
which contains the multi-byte, real text. The existing field text
is filled with generated content, maybe just a sequence of x
s, which represents the number of actual characters in the line. In this way, all the boxes logic can remain unchanged.real_text
instead of the text
, again observing the encoding.The following would be out of scope, and can be added later:
I have started implementing this feature on a branch called unicode-input. At the glacial speeds you have come to expect and appreciate, of course. 🐌 Slow, but inexorable ...
The implementation will be based on libunistring and pcre2, mostly because these libraries (a) solve the problem and (b) are available for MinGW which I need to build the Windows binary. Internal multi-byte representation will be UTF-32, for the pointer arithmetic.
The strategy mostly follows my previous post, except that we won't skip the regex stuff. And also, some deeper changes are required, as the devil is in the details ...
Fingers crossed!
A small milestone is reached:
I like how neatly it aligns the Chinese script 😊
Still lots of work to do though (regular expression handling, box removal/mending, windows port, you name it).
Merged the unicode-input branch onto master as implementation is complete. 👍
Created new branch prepare_2.0.0 where I'll straighten out a few things before the release.
If anyone wants to try out some stuff, that would be great, as any required fixes would be in time to do somthing about them. 😉
Boxes currently works with ASCII characters only. Well, at least every character must consist of a single byte. Multi-byte character support becomes increasingly important, so boxes should be upgraded to support UTF-8. (submitted by David Karafiát on Sat, 28 Jun 2008, but since requested many times)