[WIP] Literal values - Githubissues

LPeter1997 commented 2 years ago

Important: Parts of this proposal depends on what we end up in the type inference issue (#42). If we end up deciding that literals always have a fixed type, then we can introduce the usual suffixes for literals. I'm personally not a fan of those, so for now, this proposal assumes that we can agree on literals being specified during inference.

Integer literals

Decimal integers would match the regex [0-9]+. Examples: 0, 123, 9625
Hexadecimal integers would match the regex 0x[0-9a-fA-F]+. Examples: 0x0, 0xbadc0fee, 0x2f5a
Binary integers would match the regex 0b[01]+. Examples: 0b0, 0b011101

We could introduce a separator character for large constants to make them more readable. Some languages use _ for this. The only rule would be that _ can't be the first significant digit. Examples: 12_000_000_000, 0xffff_0000, 0b1100_0000_0101_1110

Boolean literals

The keywords true and false.

Floating-point literals

They would have two forms, the normal decimal-separated form and a scientific form.

Decimal separated form would match the regex [0-9]+\.[0-9]+. Examples: 0.0, 0.123, 25.0, 62.73. Note that omitting either side completely is not enabled on purpose.
Scientific notation form would match the regex [0-9]+(\.[0-9]+)?[eE][+-]?[0-9]+. Examples: 10E3, 0.1e+4, 123.345E-12

Escape sequences

They would be enclosed in single-quotes. Escaping would be the usual \. Escape sequences would be:

\': Just a ". It does not have to be escaped in a string literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in string literals. (inspired by C#)
\": Just a ". It does not have to be escaped in a character literal, but simplifies code-generation for the users. Since it's otherwise meaningless, it's essentially no effort to allow it in character literals. (inspired by C#)
\\: Escapes the \ to literally mean a \.
\[0abfnrtv]: Same as in every C-like programming language (reference)
TODO: How do we want Unicode escape sequences?

Character literals

They are enclosed in single-quotes ('), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

String literals

They are enclosed in double-quotes ("), like in C#. Any visible character can be inside (no control characters), or an escape sequence.

Verbatim strings and string interpolation is not yet specified, that will come in a later issue. For now, I believe the default strings should allow for string interpolation, there should be no need for a separate annotation.

Issue for string interpolation is #53 .

WhiteBlackGoose commented 2 years ago

Regarding unicode escape sequences. Here's what C# (from that link has)

So it uses \u for utf16, \U for utf32, \x for variable lengths, which has ambiguity problems. Let us see how we can do it universally.

Prefix

I suggest \u8, \u16, \u32 for UTF-8, UTF-16, UTF-32 respectively.

Option 1: stick the numbers right after, e. g.

\u1699...

is UTF-16 of code 99...

Option 2: separate it with something

\u16_99...

It makes it more readable, but longer.

Avoiding ambiguity for encoding

Should it be variable length, or exactly 2, 4, 8 digits for those encodings?

Option 1: fixed lengths

\u8HH
\u16HHHH
\u32HHHHHHHH

Option 2: terminating symbol

E. g. u:

\u8H*u
\u16H*u
\u32H*u

Examples

Let's go over combinations

Opt 1 & Opt 1

\u870 = p
\u16AAAA = ꪪ
\u32001F47D = 👽

Opt 1 & Opt 2

\u870u = p
\u16AAAAu = ꪪ
\u321F47Du = 👽

Opt 2 & Opt 1

\u8_70 = p
\u16_AAAA = ꪪ
\u32_001F47D = 👽

Opt 2 & Opt 2

\u8_70u = p
\u16_AAAAu = ꪪ
\u32_1F47Du = 👽

LPeter1997 commented 2 years ago

My question is, do we want different encoding escapes? Wouldn't a single Unicode codepoint escape suffice? C++ has 4-character and 8-character Unicode escapes, unrelated to any kind of encoding. With that, only a single escape, like \u{[Hh]+} could work:

\u{70} = p
\u{AAAA} = ꪪ
\u{1F47D} = 👽

svick commented 2 years ago

How do escape sequences handle characters that require multiple code units in a given encoding?

For example, consider U+20AC Euro Sign. Could I specify it as both "\u16_20AC" and "\u8_E2\u8_82\u8_AC"? And would the value of "\u8_E2\u8_82\u8_AC".Length be 1? (Assuming that the natural type of a string literal is the UTF-16 System.String.)

LPeter1997 commented 2 years ago

We have discussed that on the server (and we'll document the results of that hopefully soon). So far we have ended up on the \u{...} idea to not to mess with encoding, we only specify codepoints there. The encoding will depend on what the escape sequence is embedded inside. For string literals, it would depend on what encoding we will use for strings.

Draco-lang / Language-suggestions

[WIP] Literal values #50

Integer literals

Boolean literals

Floating-point literals

Escape sequences

Character literals

String literals

Prefix

Option 1: stick the numbers right after, e. g.

Option 2: separate it with something

Avoiding ambiguity for encoding

Option 1: fixed lengths

Option 2: terminating symbol

Examples

Opt 1 & Opt 1

Opt 1 & Opt 2

Opt 2 & Opt 1

Opt 2 & Opt 2