Open LPeter1997 opened 2 years ago
Regarding unicode escape sequences. Here's what C# (from that link has)
So it uses \u
for utf16, \U
for utf32, \x
for variable lengths, which has ambiguity problems. Let us see how we can do it universally.
I suggest \u8
, \u16
, \u32
for UTF-8, UTF-16, UTF-32 respectively.
\u1699...
is UTF-16 of code 99...
\u16_99...
It makes it more readable, but longer.
Should it be variable length, or exactly 2, 4, 8 digits for those encodings?
\u8HH
\u16HHHH
\u32HHHHHHHH
E. g. u
:
\u8H*u
\u16H*u
\u32H*u
Let's go over combinations
\u870 = p
\u16AAAA = ꪪ
\u32001F47D = 👽
\u870u = p
\u16AAAAu = ꪪ
\u321F47Du = 👽
\u8_70 = p
\u16_AAAA = ꪪ
\u32_001F47D = 👽
\u8_70u = p
\u16_AAAAu = ꪪ
\u32_1F47Du = 👽
My question is, do we want different encoding escapes? Wouldn't a single Unicode codepoint escape suffice? C++ has 4-character and 8-character Unicode escapes, unrelated to any kind of encoding. With that, only a single escape, like \u{[Hh]+}
could work:
\u{70} = p
\u{AAAA} = ꪪ
\u{1F47D} = 👽
How do escape sequences handle characters that require multiple code units in a given encoding?
For example, consider U+20AC Euro Sign. Could I specify it as both "\u16_20AC"
and "\u8_E2\u8_82\u8_AC"
? And would the value of "\u8_E2\u8_82\u8_AC".Length
be 1
? (Assuming that the natural type of a string literal is the UTF-16 System.String
.)
We have discussed that on the server (and we'll document the results of that hopefully soon). So far we have ended up on the \u{...}
idea to not to mess with encoding, we only specify codepoints there. The encoding will depend on what the escape sequence is embedded inside. For string literals, it would depend on what encoding we will use for strings.
Important: Parts of this proposal depends on what we end up in the type inference issue (#42). If we end up deciding that literals always have a fixed type, then we can introduce the usual suffixes for literals. I'm personally not a fan of those, so for now, this proposal assumes that we can agree on literals being specified during inference.
Integer literals
[0-9]+
. Examples:0
,123
,9625
0x[0-9a-fA-F]+
. Examples:0x0
,0xbadc0fee
,0x2f5a
0b[01]+
. Examples:0b0
,0b011101
We could introduce a separator character for large constants to make them more readable. Some languages use
_
for this. The only rule would be that_
can't be the first significant digit. Examples:12_000_000_000
,0xffff_0000
,0b1100_0000_0101_1110
Boolean literals
The keywords
true
andfalse
.Floating-point literals
They would have two forms, the normal decimal-separated form and a scientific form.
[0-9]+\.[0-9]+
. Examples:0.0
,0.123
,25.0
,62.73
. Note that omitting either side completely is not enabled on purpose.[0-9]+(\.[0-9]+)?[eE][+-]?[0-9]+
. Examples:10E3
,0.1e+4
,123.345E-12
Escape sequences
They would be enclosed in single-quotes. Escaping would be the usual
\
. Escape sequences would be:\'
: Just a"
. It does not have to be escaped in a string literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in string literals. (inspired by C#)\"
: Just a"
. It does not have to be escaped in a character literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in character literals. (inspired by C#)\\
: Escapes the\
to literally mean a\
.\[0abfnrtv]
: Same as in every C-like programming language (reference)Character literals
They are enclosed in single-quotes (
'
), like in C#. Any visible character can be inside (no control characters), or an escape sequence.String literals
They are enclosed in double-quotes (
"
), like in C#. Any visible character can be inside (no control characters), or an escape sequence.Verbatim strings and string interpolation is not yet specified, that will come in a later issue. For now, I believe the default strings should allow for string interpolation, there should be no need for a separate annotation.
Issue for string interpolation is #53 .