Draco-lang / Language-suggestions

Collecting ideas for a new .NET language that could replace C#
75 stars 5 forks source link

[WIP] Literal values #50

Open LPeter1997 opened 2 years ago

LPeter1997 commented 2 years ago

Important: Parts of this proposal depends on what we end up in the type inference issue (#42). If we end up deciding that literals always have a fixed type, then we can introduce the usual suffixes for literals. I'm personally not a fan of those, so for now, this proposal assumes that we can agree on literals being specified during inference.

Integer literals

We could introduce a separator character for large constants to make them more readable. Some languages use _ for this. The only rule would be that _ can't be the first significant digit. Examples: 12_000_000_000, 0xffff_0000, 0b1100_0000_0101_1110

Boolean literals

The keywords true and false.

Floating-point literals

They would have two forms, the normal decimal-separated form and a scientific form.

Escape sequences

They would be enclosed in single-quotes. Escaping would be the usual \. Escape sequences would be:

Character literals

They are enclosed in single-quotes ('), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

String literals

They are enclosed in double-quotes ("), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

Verbatim strings and string interpolation is not yet specified, that will come in a later issue. For now, I believe the default strings should allow for string interpolation, there should be no need for a separate annotation.

Issue for string interpolation is #53 .

WhiteBlackGoose commented 2 years ago

Regarding unicode escape sequences. Here's what C# (from that link has) image

So it uses \u for utf16, \U for utf32, \x for variable lengths, which has ambiguity problems. Let us see how we can do it universally.

Prefix

I suggest \u8, \u16, \u32 for UTF-8, UTF-16, UTF-32 respectively.

Option 1: stick the numbers right after, e. g.

\u1699...

is UTF-16 of code 99...

Option 2: separate it with something

\u16_99...

It makes it more readable, but longer.

Avoiding ambiguity for encoding

Should it be variable length, or exactly 2, 4, 8 digits for those encodings?

Option 1: fixed lengths

\u8HH
\u16HHHH
\u32HHHHHHHH

Option 2: terminating symbol

E. g. u:

\u8H*u
\u16H*u
\u32H*u

Examples

Let's go over combinations

Opt 1 & Opt 1

\u870 = p
\u16AAAA = ꪪ
\u32001F47D = 👽

Opt 1 & Opt 2

\u870u = p
\u16AAAAu = ꪪ
\u321F47Du = 👽

Opt 2 & Opt 1

\u8_70 = p
\u16_AAAA = ꪪ
\u32_001F47D = 👽

Opt 2 & Opt 2

\u8_70u = p
\u16_AAAAu = ꪪ
\u32_1F47Du = 👽
LPeter1997 commented 2 years ago

My question is, do we want different encoding escapes? Wouldn't a single Unicode codepoint escape suffice? C++ has 4-character and 8-character Unicode escapes, unrelated to any kind of encoding. With that, only a single escape, like \u{[Hh]+} could work:

\u{70} = p
\u{AAAA} = ꪪ
\u{1F47D} = 👽
svick commented 2 years ago

How do escape sequences handle characters that require multiple code units in a given encoding?

For example, consider U+20AC Euro Sign. Could I specify it as both "\u16_20AC" and "\u8_E2\u8_82\u8_AC"? And would the value of "\u8_E2\u8_82\u8_AC".Length be 1? (Assuming that the natural type of a string literal is the UTF-16 System.String.)

LPeter1997 commented 2 years ago

We have discussed that on the server (and we'll document the results of that hopefully soon). So far we have ended up on the \u{...} idea to not to mess with encoding, we only specify codepoints there. The encoding will depend on what the escape sequence is embedded inside. For string literals, it would depend on what encoding we will use for strings.