Mon-Ouie / coolline

Simple readline-like tool able to change representation of input
Other
82 stars 14 forks source link

Unicode bugs #15

Open epitron opened 11 years ago

epitron commented 11 years ago

I just tested out editing UTF-8 in coolline, and it doesn't seem to work properly.

@pos appears to be counting bytes, not chars, which puts the cursor way off in space. Predictably, backspace removes a byte at a time. I assume everything will exhibit this behaviour. :)

Here's a good piece of Unicode for testing: ┻━┻ ︵ヽ(`Д´)ノ︵ ┻━┻

Mon-Ouie commented 11 years ago

I'm not entirely sure, because it does work with certain characters (for example été, π, λ). I'm not sure what's the difference with your string.

All the indices should be in characters provided Ruby knows the proper encoding of the string, since string manipulation functions are character based as of 1.9.

In your case, the problematic characters seem to be ︵ヽノ︵.

epitron commented 11 years ago

Hmm! Okay, thanks for testing. I've narrowed the problem down -- for some reason, when I run a script as an executable using #!/usr/bin/env ruby, default_line="" becomes US-ASCII8BIT encoded. If i run ruby scriptname, it uses UTF8.

This is a bit weird -- I'm not really sure whose fault this is. :)

epitron commented 11 years ago

Wait.. OMG, this is so weird.

Okay, so, it has nothing to do with the script being executed like a binary.

When I first run readline, the encoding is UTF8. When I paste " ︵ヽノ︵", the encoding becomes ASCII-8BIT.

Something stinks here. :)

Here's my test script (Alt-E prints encoding/length):

#!/usr/bin/env ruby
require 'coolline'

cool = Coolline.new do |c|
  c.bind "\ee" do |c2|
    p [c2.line.size, c2.line.encoding]
  end
end

cool.readline
Mon-Ouie commented 11 years ago

I suspect the problems happen at insertion time. For example, maybe the character doesn't get inserted in one go, and when we insert part of it, the string becomes invalid as UTF-8 and the encoding gets changed.

Oddly enough, here, after pasting the same string, I get the right position and UTF-8 as an encoding, editing works, but the cursor is definitly not rendered at the right position (it appears one line below, one character to the left).

epitron commented 11 years ago

Oh man, that's weird. Now I'm getting your behaviour. Everything stays UTF8, but I get new lines.

epitron commented 11 years ago

It only happens with double-wide characters, it seems. ノ is fine, ︵ prints a new line.

epitron commented 11 years ago

After poking around with ANSI cursor positioning and double-wide UTF8 characters, it appears that they actually take up 2 columns on the display.

For example, if you print "ab︵c", then position the cursor on the screen using ANSI codes, the column of each character is as follows:

a = 1 b = 2 ︵ = 3-4 c = 5

I'm still stumped as to why it's adding a linefeed.