gchp / iota

A terminal-based text editor written in Rust
MIT License
1.63k stars 81 forks source link

Fix unicde support #69

Open gchp opened 9 years ago

gchp commented 9 years ago

Since #50 was merged, Unicode support is broken. @P1start mentioned in the comments that fixing this shouldn't be too involved.

suhr commented 9 years ago

There's actually two issues:

crespyl commented 9 years ago

95 (specifically https://github.com/crespyl/iota/commit/e643737a449851aa068f9e7a5fca8a528d7181b5) has some changes that should hopefully fix unicode rendering (it seems to work for the minimal cases in the buffer.rs tests section).

I'm not sure what to do about input; does termbox work with unicode in the first place, and might we need to fix rustbox?

gchp commented 9 years ago

From @crespyl on Gitter:

due to the nature of UTF-8, the nth char in a buffer is not necessarily at the nth byte it should be possible to use something like self.chars().indices().take(n).last().map(|(byte_index, character)| byte_index) to correctly handle multi-byte characters

Related to cursor movement over multi-byte characters.

ghost commented 9 years ago

I've been messing around with trying to add unicode support, and it is turning out to be complicated. The biggest problem I have found is that termbox expects each cell to be a single codepoint, even though there sometimes needs to be multiple codepoints per cell. It probably wouldn't be too hard to modify termbox to store each cell as an array of chars rather than a single char, although it would take away some of the simplicity of the library. And, of course, UIBuffer would also have to do this as well.

I think that some problems could be solved by using iterators over cells (where 1 cell = 1 character width) rather than over bytes, chars, or graphemes. For example, an iterator yielding Option<&str> which, for each grapheme, yields the grapheme first and then yields None for each extra character-width the grapheme takes up.

I'm guessing it would be easiest to have Buffer be an abstraction layer for all the byte-level stuff and let every other part of the code deal in characters/graphemes. This would, of course, require heavy changes to the interface of Buffer... but so would changing the data structure backing it, which might inevitably happen anyways.

In summary, it seems like an implementation of unicode support could start from two places: termbox and Buffer.

Fixing the display of @suhr's example text wasn't too hard, but the fix shows why it is probably important not to make code outside of Buffer deal with data on the byte level.

[see spaghetti code here] [see screen shot here]