crystal-lang / crystal

The Crystal Programming Language
https://crystal-lang.org
Apache License 2.0
19.36k stars 1.62k forks source link

Range of the Char #317

Closed yous closed 9 years ago

yous commented 9 years ago

Seeing this line, a Char can have its ord at most 0x1FFFFF. But for UTF-8, it ends at 0x10FFFF by RFC 3629. At first it ends at 0x1FFFFF, but restricted on November 2003.

Invalid byte sequences indicates:

A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

Also the write_utf8 method of its sample code is as same as our each_byte, but 0x10FFFF not 0x1FFFFF.

Is Char support UTF-16 for 0x10FFFF < ord <= 0x1FFFFF, or is this a mistake?

asterite commented 9 years ago

Hi @yous,

We are kind of busy these days, but we'll give you an answer soon. My first guess is that it's a mistake, but I have to check with @waj and others.

I'm curious, how did you find this little error in the code?

yous commented 9 years ago

I was expecting that "\u{FF}".bytes returns [255] but it returns [195, 191]. So I started find where it comes.

waj commented 9 years ago

Good catch!

The strings in Crystal are always represented in memory as UTF-8. That's why the character FF is represented by two bytes.

yous commented 9 years ago

I think this line of CharReader also have been affected. But we may have to be a bit more specific, by checking the first two bytes:

if first < 0xf4 || first == 0xf4 && second < 0x90
simnalamburt commented 9 years ago

:+1:

asterite commented 9 years ago

@yous You are right. We should actually copy the algorithm of the read_code_point_from_utf8 from Wikipedia, right?

unsigned read_code_point_from_utf8()
{
  int code_unit1, code_unit2, code_unit3, code_unit4;

  code_unit1 = getchar();
  if (code_unit1 < 0x80) {
    return code_unit1;
  } else if (code_unit1 < 0xC2) {
    /* continuation or overlong 2-byte sequence */
    goto ERROR1;
  } else if (code_unit1 < 0xE0) {
    /* 2-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    return (code_unit1 << 6) + code_unit2 - 0x3080;
  } else if (code_unit1 < 0xF0) {
    /* 3-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xE0 && code_unit2 < 0xA0) goto ERROR2; /* overlong */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    return (code_unit1 << 12) + (code_unit2 << 6) + code_unit3 - 0xE2080;
  } else if (code_unit1 < 0xF5) {
    /* 4-byte sequence */
    code_unit2 = getchar();
    if ((code_unit2 & 0xC0) != 0x80) goto ERROR2;
    if (code_unit1 == 0xF0 && code_unit2 < 0x90) goto ERROR2; /* overlong */
    if (code_unit1 == 0xF4 && code_unit2 >= 0x90) goto ERROR2; /* > U+10FFFF */
    code_unit3 = getchar();
    if ((code_unit3 & 0xC0) != 0x80) goto ERROR3;
    code_unit4 = getchar();
    if ((code_unit4 & 0xC0) != 0x80) goto ERROR4;
    return (code_unit1 << 18) + (code_unit2 << 12) + (code_unit3 << 6) + code_unit4 - 0x3C82080;
  } else {
    /* > U+10FFFF */
    goto ERROR1;
  }

  ERROR4:
    ungetc(code_unit4, stdin);
  ERROR3:
    ungetc(code_unit3, stdin);
  ERROR2:
    ungetc(code_unit2, stdin);
  ERROR1:
    return code_unit1 + 0xDC00;
}

I think it catches more invalid sequences than what we have right now (and I'd like to have a spec for each invalid case).

asterite commented 9 years ago

@yous Could you take a look at the last commit to see if it's ok? It's more or less Wikipedia's code with some common code refactored.

I see Wikipeadia's code returns an invalid code point if it encounters an invalid byte sequence, instead of throwing an exception. I wonder if that's what we really need to do.

Another thing that we don't do very well is this:

puts 0x1FFFF.chr # assume utf-8 encoding, works but maybe should raise

But in Ruby:

puts 0x110000.chr(Encoding::UTF_8)

Gives:

foo.cr:1:in `chr': invalid codepoint 0x110000 in UTF-8 (RangeError)

Do you think we should raise as well?

yous commented 9 years ago

@asterite I think the commit is okay.

For the first thing, seeing Codepage layout:

Red cells must never appear in a valid UTF-8 sequence. The first two (C0 and C1) could only be used for overlong encoding of basic ASCII characters (i.e., trying to encode a 7-bit ASCII value between 0 and 127 using 2 bytes instead of 1).

I think we don't need to think about encoding 7-bit ASCII value using 2 bytes. So raising makes sense. See Overlong encodings for further details:

The standard specifies that the correct encoding of a code point use only the minimum number of bytes required to hold the significant bits of the code point. Longer encodings are called overlong and are not valid UTF-8 representations of the code point. This rule maintains a one-to-one correspondence between code points and their valid encodings, so that there is a unique valid encoding for each code point, this makes string comparisons and searches well-defined.

Are we already raising an error for 0x110000.chr? Also Ruby works for:

0x1FFFF.chr(Encoding::UTF_8)

If you are thinking about pass the encoding to chr, I think this depends to our specification. If we presume every character as UTF-8 encoding, than the current way would be okay. We already encodes 0x80-0xFF using 2 bytes. See the difference between two in Ruby:

>> 0x80.chr.encoding
=> #<Encoding:ASCII-8BIT>
>> 0x80.chr.bytes
=> [128]
>> 0x80.chr(Encoding::UTF_8).encoding
=> #<Encoding:UTF-8>
>> 0x80.chr(Encoding::UTF_8).bytes
=> [194, 128]
asterite commented 9 years ago

Thanks for the detailed answer!

0x110000.chr is not raising an exception right now. I was thinking maybe we could raise. But maybe not... I always saw chr as a way to convert a number to a Char, almost like a cast. I didn't know it had a check in Ruby. That means that it'll be a bit slower with that check. And it worries me because chr is used in some low-level code, or at least in some parsers we have. So for now I think we can leave it without the check. Maybe later we can add chr! that does the check.

About your last comment, yes, Char always represents a codepoint in the UTF-8 encoding, because when you ask bytes it gives them assuming that encoding. We could add overloads for bytes and each_byte to take an encoding, or maybe that would be the job of an Encoding type. But we can think about it later :-)