dvdhrm / kmscon

Linux KMS/DRM based virtual Console Emulator
http://www.freedesktop.org/wiki/Software/kmscon
Other
432 stars 79 forks source link

Not all characters have same width #58

Closed zsx closed 11 years ago

zsx commented 11 years ago

Chinese characters are usually as twice wide as the Latin characters. It seems currently kmscon assumes that every character can be rendered in a space as wide as any ASCII character. This leads to a bug that only half of any wide-character is shown.

here is the related code in text_font_pango.c: in manager_getface: str = "abcdefghijklmnopqrstuvwxyz" "ABCDEFGHIJKLMNOPQRSTUVWXYZ" "@!\"$%&/()=?}][{°^~+*#'<>|-.:,;`´"; num = strlen(str); pango_layout_set_text(layout, str, num); pango_layout_get_pixel_extents(layout, NULL, &rec);

memcpy(&face->real_attr, &face->attr, sizeof(face->attr));
face->real_attr.height = rec.height;
face->real_attr.width = rec.width / num + 1;

and in get_glyph

pango_layout_line_get_pixel_extents(line, NULL, &rec);
glyph->buf.width = face->real_attr.width;
glyph->buf.height = face->real_attr.height;
dvdhrm commented 11 years ago

The real problem actually is that a terminal has a fixed number of columns. For most historical terminals this was 80 characters per line. That means, a single cell-size is fixed. Whatever is drawn into that cell needs to fit into it. You cannot draw something that goes over multiple cells.

Later VT series introduced double-width lines, however, they don't solve the problem either as it still doesn't allow mixed characters in a single line.

Therefore, kmscon assumes that each character it draws needs to fit into a single cell. As most tasks that you perform with a terminal require the English language, it uses the ASCII characters to compute the cell size.

Most fonts that were created for terminals take that into account and create glyphs that all have the same width. Unfortunately, that seems to be not true for CJK characters. They are, as you said, normally twice as wide as ASCII/European glyphs (even in "terminal fonts"). Now, lets assume you want to print CJK and ASCII characters simultaneously. The easiest way would be to double the cell-size and draw the ASCII characters at half the size. This would allow CJK characters to be correctly drawn, however, ASCII-text will look horrible. So this isn't acceptable. Another solution would be to scale CJK down to fit into the ASCII cell-size. Most CJK users would have to choose a bigger font-size but other than that, you can now read CJK and ASCII. However, the CJK characters now probably have an empty margin at the top and bottom because they don't have the same proportions as ASCII glyphs. This obviously looks ugly, too, but at least it is more readable than the first approach.

I can also handle CJK characters as "double-width" cells, however, this is not what any terminal-application expects so this is unacceptable.

I haven't come up with a perfect solution, yet. But I also normally never use CJK characters so I would need some native speaker to help me find a proper solution. If you have a better idea, I would be glad to hear it. I am also interested in whether other terminal emulators (like xterm, gnome-terminal, Konsole) handle this in a proper way so I can check their source code.

Thanks for the report! David

zsx commented 11 years ago

Thanks for your reply. I did a quick and dirty hack which forces every character to be double-width and sets the proper glyph width. The characters are shown fine, but as you've said, it's terrible to read ASCII characters.

I am not sure if xterm can handle this or not, but gnome-terminal does this perfectly (double-width for CJK characters and single-width for ASCII characters), so you might want to take a look at libvte's source code.

zsx commented 11 years ago

I took some time looking into how libvte handles this. The strategy looks simple: each screen has a fixed number of cells, and each cell has these attributes: {fragment, columns}. If a character can't fit into one column/cell, it will be represented by multiple consecutive cells. The leading cell has the attribute: {fragment = 0, columns = columns_it_occupies}, and all following cells have an attribute {fragment = 1}. Whether it fits into one column or not is specified in Unicode standard (libvte calls g_unichar_iswide() for this purpose)

With these attributes, the render function will render cell by cell, if it sees a leading character, it will render it; if it's a fragment, it will just move the cursor to next position.

Of course, selection, cursor movement, line wrapping also needs to change for this.

dvdhrm commented 11 years ago

Thanks a lot for the detailed description. I also found the "wcwidth()" function which tells us the width of a character. So you're right, we simply have to span these characters across multiple cells.

However, there are some corner-cases regarding line-editing: What to do if the last character is a multi-cell character? What to do if the user places the cursor at the right side of a multi-cell character? What to do if the user erases only one half of the multi-cell character? And so on.

I have a text-file with several Unicode-characters in ./docs/unicode-test.txt and when editing it with vim in kmscon I can guess some of the answers to the previous questions but I'd prefer a document where it is described thoroughly (also I doubt that it exists).

It looks like it is really easy to adjust ./src/tsm_screen.c to deal with it and then make the ./src/text*.c renderers aware of it. However, please bear with me if this takes some time.

Thanks for your investigation! David

zsx commented 11 years ago

What to do if the last character is a multi-cell character? I've tested on gnome-terminal, if this multi-cell character doesn't fit into the line and linewrap is enabled, it will move to the next line and the current line is fed with a space. I think this is what user desires.

What to do if the user places the cursor at the right side of a multi-cell character? I assume you meant left side? if so, the whole multi-cell character to the right should be shown in inverse color (as if selected).

What to do if the user erases only one half of the multi-cell character? No, they can't erase part of a character. Because kmscon can figure out how wide the character is, it should do this for the user.

Same thing for the cursor movement, you can move over half of a multi-cell character.

In simple words, all user visible operations should be based on characters instead of cells.

dvdhrm commented 11 years ago

In simple words, all user visible operations should be based on characters instead of cells.

That's the problem, I don't think this is true. For instance moving a cursor is still based on cells, not characters. The application has to correctly address character-boundaries now. If it does otherwise, kmscon should just do the best it can, but behavior is probably undefined. I tested some more and this really seems to be the case. And it makes sense because it makes the whole stuff more backwards compatible to single-cell characters.

And it makes it a lot easier to implement. So the only thing we need to do is writing a multi-cell character to multiple cells and fixing the selection-algorithm. Everything else basically stays the same. Only thing missing is how multi-cell characters are destroyed. But that's probably not that complicated, either.

zsx commented 11 years ago

For instance moving a cursor is still based on cells, not characters.

hmmm, imagine a user types in some multiple-cell characters in the command line and he/she wants to change some of them, so he/she moves the cursor back, I don't think he/she expects to see the cursor moving cell by cell and to insert something in the middle of a multiple-cell character. Or you expect bash/sh/whatever to send multiple move-cursor-back's for each multiple-cell character in this case?

dvdhrm commented 11 years ago

Yes, I expect bash to send two "cursor-back" commands in this case. It seems to be the current behavior so I have to implement it this way.

dvdhrm commented 11 years ago

I fixed the font-renderers and console-renderers to correctly render multi-cell glyphs. After that I pushed commit 03aab2b54b115b8f01856a24d3a00eb7a9f12e14 which implements basic multi-cell character support in TSM.

I have still some issues to work out, but it works for me. Could you give it a try?

Thanks for your input! David

zsx commented 11 years ago

Thanks David. it worked fine for my limited testing. To be exact, I only tested editing with vim and show doc/unicode-test.txt on the screen. I have no way to input a multiple-cell character, so I didn't test back-space/cursor-back's.

dvdhrm commented 11 years ago

Perfect. If further issues show up, simply file new bug-reports and I will tackle it. Thanks for testing it!