Reconsider digit separators

jonmeow commented 2 years ago

At present Carbon restricts integer digit separators to every 3 digits, going back to https://github.com/carbon-language/carbon-lang/blob/trunk/proposals/p0143.md.

A contrary mention had been made about the Indian convention. However, it looks like CJK cultures were overlooked, maybe due to conflicting information in https://en.wikipedia.org/wiki/Decimal_separator#Digit_grouping (which says eastern countries have switched to 3 digit groups). According to https://www.statisticalconsultants.co.nz/blog/how-the-world-separates-its-digits.html offers that China uses every 4 digits.

In light of the greater amount of convention differences, it may be worth supporting more variations (e.g., support 3 different conventions for digit groupings), or otherwise loosen restrictions. While that could end up with ambiguous placement for some numbers, larger numbers would less ambiguous because the groupings would repeat.

Note, I think this arose from this tweet

lexi-nadia commented 2 years ago

Besides international variations, there are also microformats. For example:

let mac_address: i64 = 0xa1_b2_c3_d4_e5_f6;
let uuid: i128 = 0x123e4567_e89b_12d3_a456_426614174000;

mo-xiaoming commented 2 years ago

As a Chinese developer, I can say

yes, in our culture, we're used to 4 digit groups
However, as a developer, I'm quite comfortable with 3 digit groups (stockholm syndrome?)
@lexi-nadia has a very good point on hex numbers

So, maybe adding this kind of variation is worth a while

nigeltao commented 2 years ago

I'm not saying you should do this, just throwing out a related idea...

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

chandlerc commented 2 years ago

In Carbon, 0x1A is valid but 0x1a is not. Unlike C/C++, hex digits are case sensitive.

In Wuffs, both are valid (from the compiler's point of view) but the formatter (the equivalent of clang-format, gofmt, rustfmt, etc) canonicalizes it as 0x1A. The convention is to run the formatter regularly (e.g. in on-file-save or pre-commit hooks) and so, in practice, you only see 0x1A and never 0x1a. But this still lets you copy/paste 0x1a from a StackOverflow post, even if that post discusses a different programming language.

I think this is a pretty separate question, so if you'd like to pursue it I would move it. FWIW, we can have a near perfect recovery here in the frontend and suggest edits, so I think the difference isn't huge, but it is a difference.

FWIW, Wuffs' formatter's canonicalization of numeric literals also inserts underscores at every 6 digits for decimal and at every 4 digits for hexadecimal: it's 3_141592 and 0xDEAD_BEEF. The point being that there is one canonical spelling of every numeric literal, just like there's one canonical indentation style (and no endless tabs vs spaces debate). Whether it's every 3 or 6 digits, for decimal, isn't that important. As said about Go: "Gofmt's style is nobody's favourite, but gofmt is everybody's favourite".

Given the semantically meaningful different groupings mentioned here, I think this question should include not canonicalizing in the formatter. FWIW, I'm sufficiently convinced by things like credit card numbers, UUIDs, and MAC addresses that we should have this flexibility even outside of any ideas around regional differences or different bases.

nigeltao commented 2 years ago

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:

let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;

In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

As for credit card numbers, do people actually process them as numbers (as opposed to strings)?

chandlerc commented 2 years ago

FWIW, being hexadecimal, UUIDs and MAC addresses aren't unusably bad if you enforce underscores every 4 digits. The natural microformat boundaries are already multiples of two bytes. Even if the natural UUID grouping involves the last 12 hex digits, that's still easy to see here:
let mac_address: i64 = 0xa1b2_c3d4_e5f6;
let uuid: i128 = 0x123e_4567_e89b_12d3_a456_4266_1417_4000;
In the MAC address case, "what's the 3rd byte" is still much easier to eyeball with "underscore every 4" than with no underscores at all.

I still find the versions above significantly more readable than these. I agree that no digit separators would be even worse, but I don't think that's really the question. I think the readability gain of format-specific grouping is worthwhile based on the examples here.

zygoloid commented 2 years ago

We seem to have good evidence here that we should reconsider this decision, and a good level of consensus for making a change. The next step would be for someone to write a proposal presenting these arguments.

ethomag commented 2 years ago

Maybe I misinterpreted this (in docs/design/lexical_conventions/numeric_literals.md)

For real-number literals, digit separators can appear in the decimal and hexadecimal 
integer portions (prior to the period and after the optional e or mandatory p)

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs. Consider:

let nanosecond: f64 = 0.000000001;

vs

let nanosecond: f64 = 0.000_000_001;

I think that improves readability as much as digit separators in the integer part.

jonmeow commented 2 years ago

Created a proposal on #1983 -- let me know if I've misunderstood leads direction there, I can always flip around alternatives if the leads want a different choice.

I don't understand the restriction of having digit separators only to the left of the decimal point for real numbers and I could not find any rationale behind it in the docs.

AFAICT your interpretation is correct, although the proposal has some conflicting examples in ties. Anyways, I think #1983 should produce clear rationale either way.

ethomag commented 2 years ago

Thanks @jonmeow for your reply. My concern was not about ties, but strictly readability. I think scientific notation is symmetric around the decimal point. To be able to group decimal digits in the integer part so that you can easily eyeball which parts are grams, kilograms etc is something that can aid avoiding making mistakes when defining constants. I just think the same argument holds for milligrams, micrograms etc.

I could not find any rationale that I could understand in the referred links, but it seems you have already considered this. I was just naïvely thinking that this was something that was overlooked.

I am truly amazed by your work, it's quite a challenge you have taken on!

chandlerc commented 2 years ago

(removing good-first-issue label as this is now in progress)

jonmeow commented 2 years ago

I believe this is resolved by #1983 though I still need to update the design (but I think we can call the leads question closed).

carbon-language / carbon-lang

Reconsider digit separators #1485