Closed yous closed 9 years ago
Hi @yous,
We are kind of busy these days, but we'll give you an answer soon. My first guess is that it's a mistake, but I have to check with @waj and others.
I'm curious, how did you find this little error in the code?
I was expecting that "\u{FF}".bytes
returns [255]
but it returns [195, 191]
. So I started find where it comes.
Good catch!
The strings in Crystal are always represented in memory as UTF-8. That's why the character FF is represented by two bytes.
I think this line of CharReader also have been affected. But we may have to be a bit more specific, by checking the first two bytes:
if first < 0xf4 || first == 0xf4 && second < 0x90
:+1:
@yous You are right. We should actually copy the algorithm of the read_code_point_from_utf8
from Wikipedia, right?
unsigned read_code_point_from_utf8()
{
int code_unit1, code_unit2, code_unit3, code_unit4;
code_unit1 = getchar();
if (code_unit1 < 0x80) {
return code_unit1;
} else if (code_unit1 < 0xC2) {
/* continuation or overlong 2-byte sequence */
goto ERROR1;
} else if (code_unit1 < 0xE0) {
/* 2-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
return (code_unit1 << 6) + code_unit2 - 0x3080;
} else if (code_unit1 < 0xF0) {
/* 3-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; /* overlong */
code_unit3 = getchar();
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
} else if (code_unit1 < 0xF5) {
/* 4-byte sequence */
code_unit2 = getchar();
if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; /* overlong */
if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; /* > U+10FFFF */
code_unit3 = getchar();
if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
code_unit4 = getchar();
if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
} else {
/* > U+10FFFF */
goto ERROR1;
}
ERROR4:
ungetc(code_unit4, stdin);
ERROR3:
ungetc(code_unit3, stdin);
ERROR2:
ungetc(code_unit2, stdin);
ERROR1:
return code_unit1 + 0xDC00;
}
I think it catches more invalid sequences than what we have right now (and I'd like to have a spec for each invalid case).
@yous Could you take a look at the last commit to see if it's ok? It's more or less Wikipedia's code with some common code refactored.
I see Wikipeadia's code returns an invalid code point if it encounters an invalid byte sequence, instead of throwing an exception. I wonder if that's what we really need to do.
Another thing that we don't do very well is this:
puts 0x1FFFF.chr # assume utf-8 encoding, works but maybe should raise
But in Ruby:
puts 0x110000.chr(Encoding::UTF_8)
Gives:
foo.cr:1:in `chr': invalid codepoint 0x110000 in UTF-8 (RangeError)
Do you think we should raise as well?
@asterite I think the commit is okay.
For the first thing, seeing Codepage layout:
Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using 2 bytes instead of 1).
I think we don't need to think about encoding 7-bit ASCII value using 2 bytes. So raising makes sense. See Overlong encodings for further details:
The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point, this makes string comparisons and searches well-defined.
Are we already raising an error for 0x110000.chr
? Also Ruby works for:
0x1FFFF.chr(Encoding::UTF_8)
If you are thinking about pass the encoding to chr
, I think this depends to our specification. If we presume every character as UTF-8 encoding, than the current way would be okay. We already encodes 0x80-0xFF
using 2 bytes. See the difference between two in Ruby:
>> 0x80.chr.encoding
=> #<Encoding:ASCII-8BIT>
>> 0x80.chr.bytes
=> [128]
>> 0x80.chr(Encoding::UTF_8).encoding
=> #<Encoding:UTF-8>
>> 0x80.chr(Encoding::UTF_8).bytes
=> [194, 128]
Thanks for the detailed answer!
0x110000.chr
is not raising an exception right now. I was thinking maybe we could raise. But maybe not... I always saw chr
as a way to convert a number to a Char
, almost like a cast. I didn't know it had a check in Ruby. That means that it'll be a bit slower with that check. And it worries me because chr
is used in some low-level code, or at least in some parsers we have. So for now I think we can leave it without the check. Maybe later we can add chr!
that does the check.
About your last comment, yes, Char always represents a codepoint in the UTF-8 encoding, because when you ask bytes
it gives them assuming that encoding. We could add overloads for bytes
and each_byte
to take an encoding, or maybe that would be the job of an Encoding
type. But we can think about it later :-)
Seeing this line, a
Char
can have itsord
at most0x1FFFFF
. But for UTF-8, it ends at0x10FFFF
by RFC 3629. At first it ends at0x1FFFFF
, but restricted on November 2003.Invalid byte sequences indicates:
Also the
write_utf8
method of its sample code is as same as oureach_byte
, but0x10FFFF
not0x1FFFFF
.Is
Char
support UTF-16 for0x10FFFF < ord <= 0x1FFFFF
, or is this a mistake?