carbon-language / carbon-lang

Carbon Language's main repository: documents, design, implementation, and related tools. (NOTE: Carbon Language is experimental; see README)
http://docs.carbon-lang.dev/
Other
32.34k stars 1.48k forks source link

Is concatenation going to be written `+` or something else? #457

Closed josh11b closed 1 year ago

josh11b commented 3 years ago

For example, will you concatenate strings with + or another operator? Context: https://en.wikipedia.org/wiki/Concatenation

Advantage of +:

Advantage of something else:

zygoloid commented 3 years ago

Even if there were no types that support both addition and concatenation, using distinct operators for addition versus concatenation seems reasonable for expressivity purposes. And I think concatenation is an operation that's common enough that allocating an extra symbol for it would be reasonable; I don't consider allocating an extra symbol to be a significant disadvantage. However, it does seem to create a potential interoperability friction with C++. When calling Carbon from C++, would we map a C++ + to both addition and concatenation operators and see which one works? When calling C++ from Carbon, would we map a Carbon + and a Carbon concatenation operator to the single C++ + operator? Or maybe we could add a source-level annotation on the C++ side to say which kind of plus an operator+ provides?

If we are able to support a single operator token being all of prefix, infix, and postfix, we could use an infix ++ for this purpose (not to avoid allocating an operator, but simply because it seems to have familiar "additive" connotations while not being +); this symbol is used for concatenation in some functional programming languages.

Concatenation is not commutative, so + is unnatural mathematically.

I think there's a broader question here: do we want to have some notion of semantics associated with our (overloadable) operators, or is it just a free-for-all? For example, should we be able to assume that + and * form a ring, and perhaps even optimize accordingly?

I think we need to be careful here, and consider how we would fit floating-point types and integer overflow into such a model. Commutativity of + is rare in being a mathematical property that actually holds for integer and floating-point types (at least, if we view the choice of which payload to propagate in NaN + NaN as being nondeterministic). I wonder how much we should worry about ensuring this property holds, if we can't do that for any other mathematical property. Perhaps we can still provide certain guidelines about how overloads should work, even if they are not enforceable or even strictly true in practice -- it seems helpful to a reader to be assured that + does something plusish -- but that sounds like a style guide rule rather than a language rule. Nonetheless if we have a separate concatenation operator I would expect people to use it rather than overloading + to work on strings.

tkoeppe commented 3 years ago

Concatenation is not commutative, so + is unnatural mathematically.

I think there's a broader question here: do we want to have some notion of semantics associated with our (overloadable) operators, or is it just a free-for-all? For example, should we be able to assume that + and * form a ring, and perhaps even optimize accordingly?

This already assumes that these binary operators are homogeneous (T, T) -> T, and precludes simple affine constructions like pointer and iterator arithmetic. Is that a restriction we would want?

geoffromer commented 3 years ago

I think that the canonical syntax for concatenating N strings in Carbon should have optimal performance in at least the following senses:

It's far from clear that an infix binary operator will be able to satisfy these requirements, regardless of how we spell it; as far as I can tell, the only way to do so is to for the language to provide some way of transforming a chain of operators into the equivalent of a single function call. That could take the form of something like expression templates, or some sort of bespoke language-level rewrite rule as discussed in #451, or maybe some kind of static reflection, or who knows what else.

On the other hand, I am very confident that a function call syntax will be able to satisfy these requirements in Carbon, because it can already do so pretty straightforwardly in C++ (see absl::StrCat for an example), so a function-call syntax seems like the option that will impose the least burden on the design of the language. I think it will also impose the least burden on the reader, because the syntax directly corresponds to the semantics, without an intervening transformation step, and because we can use a meaningful name rather than repurposing or inventing some punctuation mark.

jonmeow commented 3 years ago

I think it's worth covering cross-language precedent a bit more if breaking away from C++ syntax, particularly for string concatenation. Going through the top 20 at https://pypl.github.io/PYPL.html, comparing with https://rosettacode.org/wiki/String_concatenation:

zygoloid commented 3 years ago

There are a collection of operations in this space, including at least these:

We presumably want some combination of these operations to be available, but not necessarily all of them. What use cases do we want to address with concatenation in particular rather than one of the other operations?

Given @geoffromer's comment and Carbon's efficiency goal, we should consider eschewing concatenation in favor of other options.

geoffromer commented 3 years ago
  • Given two strings, form a third string that is the concatenation of those two as efficiently as possible ("concatenation").

I suggest calling this "binary concatenation"; in my experience the term "concatenation" is often applied to APIs taking an arbitrary number of operands (e.g. StrCat, the unix cat command, etc), so trying to use it in this more restrictive sense seems likely to cause confusion.

  • Given only a list of arguments, format them in a canonical way and append them, as if formatting with a format string "%0%1%2..." or similar ("StrCat"). Note that this is a special case of both interpolation and concatenation.

I think that's a generalization of binary concatenation rather than a special case, for two reasons. The boring reason is that "an arbitrary number" is a generalization rather than special case of "two", but the interesting reason is the addition of "format them in a canonical way": as defined, this operation can take any operand that has a canonical format-as-string operation, whereas concatenation was defined to take only string operands.

Interpolation and streaming can likewise be generalized to support arbitrary string-formattable types (e.g. printf and <<, respectively). In principle binary concatenation could be generalized in that way, but I've never seen that done in practice, possibly because such a generalized binary concatenation operation can't be spelled as infix + (or any other overloaded spelling), because that would lead to ambiguity about e.g. whether 1 + 2 is 3 or "12".

I should note that my use of StrCat as an example wasn't intended to focus on the fact that it supports non-string types, and in fact I'm somewhat skeptical of generalizing concatenation (or streaming) in that way. But to the extent that we do want to generalize concatenation in that way, that's another reason to avoid using a binary operator syntax for it.

zygoloid commented 3 years ago

Interpolation and streaming can likewise be generalized to support arbitrary string-formattable types [...]. In principle binary concatenation could be generalized in that way, but I've never seen that done in practice

This is commonplace in dynamically-typed languages -- for example, JavaScript and Perl both do this. VBScript does too, but uses & as the formatting binary concatenation operator rather than +, so eg 1 + 2 is 3 but 1 & 2 is "12" (though "1" + "2" is also "12", so it's not the case that + is only a numeric operation). It seems like a desirable operation in at least some domains, but I share your inclination to avoid using a binary operator for this purpose. But perhaps that doesn't strongly inform the question of whether we should expose a non-formatting binary concatenation operator as +.

github-actions[bot] commented 3 years ago

We triage inactive PRs and issues in order to make it easier to find active work. If this issue should remain active or becomes active again, please comment or remove the inactive label. The long term label can also be added for issues which are expected to take time. This issue is labeled inactive because the last activity was over 90 days ago.

lexi-nadia commented 2 years ago

This already assumes that these binary operators are homogeneous (T, T) -> T, and precludes simple affine constructions like pointer and iterator arithmetic. Is that a restriction we would want?

Context: https://en.m.wikipedia.org/wiki/Level_of_measurement

I think the examples you bring up, along with others (timestamps; temperatures in °C or °F), all fit vaguely under the "interval scales" category -- they're references with no magnitude.

For an interval type T and a cprresponding offset type O, we need to support the following options:

  1. Offset: T ± O -> T
  2. Difference: T - T -> O

Hypothetically, we could have different spellings for these operations, but it's hard to imagine this being a problem in practice. The types are different. (I might even suggest that we use different interfaces to represent these operations, even if they map to the same operator tokens.)

Concatenation is a very different case for me. Rather than defining new operations, it reuses T + T -> T in a way that breaks commutativity. Multiplication barely makes sense here, and subtraction and division make no sense at all. (And there are cases, like vectors, where both addition and concatenation could make sense!) That's why i'm so uneasy about using + for this case; it's just not addition-like at all.

lexi-nadia commented 2 years ago

(As an aside, i think vector concatenation may be a better motivating example than string concatenation.)

mossaiby commented 2 years ago

Let me add that in Julia, the string concatenation is performed using *. It was odd for me at first, but became natural when I thought of it as in math; a * b = ab, hence 'a' * 'b' = 'ab'. Hope it helps.

chandlerc commented 2 years ago

Fundamentally, I think Carbon should consider the + operator symbol to represent some kind of "add" operation where "add" is an abbreviation for addition. The language should not decide that + can also mean concatenation, it should only consider it addition.

Whether it makes sense to use that language concept of addition for a type to mean concatenation is a question for the author of that type.

I can imagine types which have really good reasons that the only possible and useful model of addition on the type is concatenation. But I can also imagine many, many types for which that is not the case. @lexi-nadia gives a great example of vectors.

I don't think we can hope for Carbon to have the language indicate one way or another here, at least not at this stage.

This in turn raises a few tightly related questions we need to answer to close this out:


First: should we add a new operator symbol to Carbon so that it could be a language-level symbol for concatenation?

I suggest that we do not do this (for now). I think there are a lot of healthy ideas for how to make even types where concatenation is unambiguously not well aliased to addition reasonably ergonomic.

We can revisit this in the future if we get substantial information indicating that many types would benefit from this expansion of our operator set. But operators are a reasonably expensive syntactic space to begin with, and I think at least today I don't see nearly enough motivation. I would much prefer to invest in other syntax tools that address the same or similar use cases.


Second: should strings in Carbon use addition to mean concatenation?

I somewhat strongly think this is not the right direction.

There are many challenges with this model raised in this thread already. I'll add one more that for me is particularly important: strings are especially common to be accidentally used instead of some other type. There is even the joking phrase ["stringly typed APIs"]() because strings get (over)used so often when there is notionally some other typed data serialized within the string.

Because of this frequent type confusion, I think we should be especially careful in using expression syntaxes with strings that might be surprising to apply to a string. I worry that indeed, they will, and the code will become harder to read as the reader reasonably assumes the wrong type rather than deducing an unconventional operation.

Combining this with the other issues, I think we should focus on techniques like string interpolation, APIs like StrCat, etc. These will still give us good migration strategies for C++ code that uses string addition, and will IMO at least result in more readable code.


Aside: I'd prefer to not anchor this around commutativity FWIW. I think that we should start without assuming commutativity for operators, even where it is extremely common. For example, if we assume + is commutative I think it will be surprising that we don't assume * is commutative. But * has many more cases where this isn't appropriate. If we want to add the ability to reason about commutativity of operators, I think we should do so in a way that can be controlled by the types in question so that both commutative and non-commutative case can be supported for the same syntax, and so that we can use the tools for any particular operator.

eeshvardasikcm commented 1 year ago

This is an issue for leads. Please stop me from commenting on issues for leads at any time if you need to. I will be studying more Carbon docs soon. One reason for this comment is to encourage Carbon to be able to answer the question "Why is Carbon language not yet making a decision about the usage of the + operator with strings?"

I don't consider allocating an extra symbol to be a significant disadvantage.

Creativity should never be stifled. However, an extra symbol can be disadvantages if it's not thoroughly referenced against math theory and potential linguistic evolution. In this regard, first implementation operators looking and compiling similar to C++ and other popular languages is very likable.

I think it's worth covering cross-language precedent a bit more if breaking away from C++ syntax

The difficult question in this regard is "How do we allow advances to be made using creativity and, allow C++ people to have a familiar experience, and at the same time prevent Carbon from suffering expensive language maintenance?" Interesting to note is that following the stringent laws of mathematics closely will not allow character flaws that themselves may lead to new innovations.

This Carbon evolution question appears to be less difficult if Carbon is going to allow the early stages of Carbon language 2.0 to start from scratch again with another round of experimental growth, followed by another roadmap date that ends experimental development for 2.0.

Right now my favorite let command statement when chatting with clang is "let it be so for now because I either don't know what's going to happen next or I don't have time to implement what I plan on allocating." Followed by the typo excuse "canst boolean u = do(it, this, time.now())." Seriously, if const is already Provisional, then let's make const a reality before defining interstellar '0' dials.

Concatenation is not commutative, so + is unnatural mathematically.

Kotlin uses s.plus(any) , while clanging heavily at +. Hoisting any it in Kotlin makes for easy coding and powerful results. Carbon is likely to favor more of a concrete mathematics based linguistics before encountering state. The word 'plus' is synonymous with the word 'add' and the phrase 'in addition to'.

ambiguity about e.g. whether 1 + 2 is 3 or "12".

The epsilon calculus of this issue is endless. Carbon should continue proceeding in this regard with labels like Operator Proposal Known Constraint, Provisional, and the productivity favorite Roadmap Horizon Concept Identified.

Perhaps we can still provide certain guidelines about how overloads should work, even if they are not enforceable or even strictly true in practice

General guidelines on making operator proposals. Guidelines should then evolve incrementally. Each new unique operator proposal should be bounded by the practical implementation of a function-call syntax version being created at Carbons current roadmap position, upon instance of new operator proposal. Carbon should be able to have practical mastery of the English and Greek languages, in addition to mathematics terminology, before making any new operator proposal; in order to achieve the utmost perfection of science. Right now, Carbon has keywords like let and match. Function-call style syntax can take over where keywords stop for now. That's okay with me. New operators could follow thereafter for all-time-Carbon-greatness, but Carbon is an evoling language. Carbon doesn't need to advance it's own scientific operator potential until concrete decisions are made about how to describe operators and how operators are used.

This already assumes that these binary operators are homogeneous (T, T) -> T, and precludes simple affine constructions like pointer and iterator arithmetic. Is that a restriction we would want?

Right now, I don't like restrictions early on like this that may lead to Carbon evolutions of it's own operators. Carbons ability to self evolve it's own operators is best left for later. Contributors don't need the phycological responsibility of, today, envisioning and manifesting every detail of an ideal Carbon future in mathematics. Mathematicians, statisticians and ecologists are already taking those responsibilities. Todays automobile requires earths carbon. The creation of Carbon language is an effect, caused by previous scientific advances.

Multiplication barely makes sense here, and subtraction and division make no sense at all.

Carbon has yet to take shape, assume identity, and look in the mirror. I don't think Carbon is yet ready for the white rabbit. Generally speaking, it's possible to set an operator proposal guideline that requires discrimination of various mathematical descriptions of operator use potential, and then requires concrete use case examples of each distinctive mathematical possibility. These guidelines may be designed to keep developers focused on the big picture; while allowing mathematicians, data scientists, and other scientists to make operator proposals. This proposal guideline may take shape and easily advance to roadmap and implementation of operator proposals, without the need to hold scientific discourse every time a new operator possibility appears. During this time now of first implementation, I like these:

I think it's worth covering cross-language precedent a bit more if breaking away from C++ syntax

There are a collection of operations in this space, including at least these: ...

a function-call syntax seems like the option that will impose the least burden on the design of the language.

Clever operators can be proposed after a function-call syntax is in place. Once practical implementations are in place, operator proposals will practically write themselves according to the laws of functional decomposition. I like the idea of composing Carbon in a way that any conductor may effortlessly sing while Carbon naturally decomposes according to score at optional tempos, moods or situational nuances. At release points the composition performance is debuted, and then another Carbon score will be requested.

Pursuit of the technically accurate, practically impractical, deprecated, usable, good idea, bad solution:

Roadmap Horizon Identified Solution is probably not the best name. Operator Proposal Known Constraint isn't a great name either.

Maybe:

is a good solution? idk.

I have just referenced some of the project docs. I think this is not my area for making changes or updates to the roadmap. There's a suggestion in the docs that I may be able to get involved with the roadmap of my subteam. I haven't been officially assigned to any subteam. The nature of the tasks that I have been asked to get involved with indicate that there may be possibility of forming a subteam. I may be able to translate these suggestions I made directly to the work I'm doing for Carbon right now assuming a loosely formed subteam. Maybe if I go ahead and consider that I am my own subteam, then my project management skills can be put to use immediately without any worry of overlap and my personal roadmap may tend to merging with the larger subteam. https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/roadmap_process.md

I should probably study more of what chandlerc has already elaborated upon in volumes worth of his work on Carbon Project thus far.

Combining this with the other issues, I think we should focus on techniques like string interpolation, APIs like StrCat, etc. These will still give us good migration strategies for C++ code that uses string addition, and will IMO at least result in more readable code.

Practical. Productive.

Second: should strings in Carbon use addition to mean concatenation? I somewhat strongly think this is not the right direction. I don't think we can hope for Carbon to have the language indicate one way or another here, at least not at this stage.

Roadmap Horizon Concept Identified.

I think there are a lot of healthy ideas for how to make even types where concatenation is unambiguously not well aliased to addition reasonably ergonomic.

Accurate.

Second: should strings in Carbon use addition to mean concatenation? I somewhat strongly think this is not the right direction.

Carbon earns respect by ending problems like this. If it's not an overly strong indicator, I propose 'defining that addition not being considered a similar meaning to concatenation' be considered a Operator Proposal Known Constraint.

Aside: I'd prefer to not anchor this around commutativity FWIW. I think that we should start without assuming commutativity for operators, even where it is extremely common. For example, if we assume + is commutative I think it will be surprising that we don't assume * is commutative.

If it's not an overly strong indicator, I propose that 'the concept of commutativity of operators' becomes a Operator Proposal Known Constraint.

If we want to add the ability to reason about commutativity of operators, I think we should do so in a way that can be controlled by the types in question so that both commutative and non-commutative case can be supported for the same syntax, and so that we can use the tools for any particular operator.

Efficient.

zygoloid commented 1 year ago

Leads decision: we follow @chandlerc's most recent comment. + is overloadable but we intend for it to mean "add" not "concatenate", and Carbon's string type will use some other mechanism for string concatenation.